Components of Python for DS
1.3.1 Python Components for Data Science
Data science” is
just about as broad of a term as they come. It may be easiest to describe what
it is by listing its more concrete components:
Data exploration
& analysis.
- Included here: Pandas; NumPy; SciPy; a helping hand from Python’s Standard
Library.
Data visualization. A pretty self-explanatory name.
Taking data and turning it into something colorful.
- Included here: Matplotlib; Seaborn; Datashader; others.
Classical machine
learning. Conceptually, we
could define this as any supervised or unsupervised learning task that is not
deep learning (see below). Scikit-learn is far-and-away the go-to tool for
implementing classification, regression, clustering, and dimensionality
reduction, while StatsModels is less actively developed but still has a number
of useful features.
- Included here: Scikit-Learn, StatsModels.
Deep learning. This is a subset of machine
learning that is seeing a renaissance, and is commonly implemented with Keras,
among other libraries. It has seen monumental improvements over the last ~5
years, such as AlexNet in 2012, which was the first
design to incorporate consecutive convolutional layers.
- Included here: Keras, TensorFlow, and a whole host of others.
Data storage and
big data frameworks. Big data is best defined as data that is either literally too large to
reside on a single machine, or can’t be processed in the absence of a
distributed environment. The Python bindings to Apache technologies play
heavily here.
- Apache Spark; Apache Hadoop; HDFS; Dask; h5py/pytables.
odds and ends. Includes subtopics such as natural
language processing, and image manipulation with libraries such as OpenCV.
- Included here: nltk; Spacy; OpenCV/cv2;
scikit-image; Cython.
Among the various components listed above, we will
discuss about most important components in Python for Data Science.
1.3.1.1. pandas (Python n-dimensional arrays)
Pandas
is a Python library used for working with data sets. It has functions for analyzing, cleaning,
exploring, and manipulating data. The name "Pandas" has a reference
to both "Panel Data", and "Python Data Analysis" and was
created by Wes McKinney in 2008.
Features:
·
Provides a really fast and efficient way to
manage and explore data
·
Alignment and indexing is so easy
·
Handling missing data is very much possible
·
Cleaning up data is simple
·
Input and output tools are there
·
Multiple file formats are supported
·
Merging and joining of datasets is so easy
·
A lot of time series support in pandas
·
Optimized performance for arrays is supported
·
Very opt for Visualization tool
·
Supports Grouping
·
Supports Masking data
·
Supports to pick unique data
·
Perform mathematical operations on the data
Installation:
pip install pandas
Usage:
import pandas as pd
1.3.1.2. NumPy:
Python
numPy is a library adding support for large, multi-dimensional arrays and
matrices, along with a large collection of high-level mathematical functions to
operate on these arrays.
numPy
(Numerical Python) is an open-source core Python library for scientific
computations. It is a general-purpose array and matrices processing package.
Python
is slower as compared to Fortran and other languages to perform looping. To
overcome this we use numPy that converts monotonous code into the compiled
form.
Faster
than Python inherent list data structure, takes less storage and more compact.
Important Features of numpy
- High-performance N-dimensional array object - 1D
array, 2D Array, Multi-Dimensional Array
- It contains Tools for integrating many C/C++ like
languages
- It contains
a multidimensional container for generic data
- It
supports linear algebra, Fourier transform, and random number capabilities
- It
consists of broadcasting functions
- It
had data type definition capability to work with varied databases
Installation:
pip install numpy
Usage:
import numpy as np
Which is better NumPy or
Pandas?
NumPy is more memory efficient in comparison
to Pandas.
It
helps to work on the “N” dimensional data structure which gives it a clear edge
over Pandas data frames.
When
it comes to working in the domain of data science, the NumPy library possesses
multiple toolkits such as Tensorflow and Seaborn which can be fed to the
models, unlike Pandas.
NumPy is also relatively
faster than the Pandas series as
it takes much time for indexing the data frames.
Pandas have their own
importance as the python library, but looking at all the above advantages
offered by the NumPy, the conclusion is that NumPy is better than Pandas.
1.3.1.3. matplotlib
Matplotlib is a low level graph plotting library in python that
serves as a visualization utility.
Matplotlib was created by John D. Hunter.
Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments are written
in C, Objective-C and Javascript for Platform compatibility.
Matplotlib is one of the most amazing powerful
libraries in python for data visualization.
Data visualization which
is the process of translating the numbers, text, or large data sets into
various types of graphs such as histograms, maps, bar plots, pie charts, etc.
For visualizations, we need some tools or technology. This is the library which
helps to draw charts.
Matplotlib is an open-source drawing library that
supports various drawing types, Easy to generate plots, histograms, bar charts,
scatter and other types of charts with just a few lines of code
It’s often used in web application servers, shells,
and Python scripts
The above fig shows Various Drawing Types supported by matplotlib.pyplot
Features
· Supports 14 types of charts
· Charting is easy, fun and child’s play with
matplotlib.
· Supports Titles, Labels, gridlines, legends, color and
some cool visual aids for charts
Installation
pip install matplotlib
Usage
Import matplotlib.pyplot as plt
1.3.1.4. scikit (scikit-learn)
Scikit-learn (Sklearn) is the most useful and robust library
for machine learning
It provides a selection of efficient tools for machine
learning and statistical modeling including classification, regression,
clustering and dimensionality reduction
This library is built upon NumPy, SciPy and Matplotlib.
Scikit-learn (Sklearn) is the most useful and robust library for machine
learning
This library is built upon NumPy, SciPy and Matplotlib.
Through scikit-learn, we can implement various machine
learning models for regression, classification, clustering, and statistical
tools for analyzing these models.
It also provides functionality for dimensionality reduction,
feature selection, feature extraction, ensemble techniques, and inbuilt
datasets. We will be looking into these features one by one.
Sci-kit
Features
- Data Splitting
- Splitting the dataset is essential for an
unbiased evaluation of prediction performance.
- Linear
Regression - This supervised ML model is used when the
output variable is continuous and it follows linear relation with dependent
variables. It can be used to forecast sales in the coming months by analyzing
the sales data for previous months.
- Logistic
regression : Logistic Regression is also a supervised
regression algorithm just like linear regression. The only difference is that
the output variable is categorical. It can be used to predict whether a patient
has heart disease or not.
- Decision Trees : A Decision Tree is a powerful tool that can be used for both
classification and regression problems. It uses a tree-like model to make
decisions and predict the output. It consists of roots and nodes. Roots
represent the decision to split and nodes represent an output variable value. A
decision tree is an important concept. Decision trees are useful when the dependent variables do not follow a
linear relationship with the independent variable i.e linear regression does
not accurate results.
- The ensemble method is a technique in which multiple
models are used to predict the output variable instead of a single one. The
dataset is randomly divided into subsets and then passed to different models to
train them. The average of all the models is considered when we predict the
output. The Ensemble technique is used to reduce the variance-biases trade-off.There are generally two types of ensembling techniques:
- Bagging is a technique in which multiple models of the same
type are trained with random samples from the training set. The inputs to
different models are independent of each other.
- Boosting is a technique in which multiple models are trained
in such a way that the input of a model is dependent on the output of the
previous model. In Boosting, the data which is predicted incorrectly is given
more preference.
- Random Forest
- Random Forest is a bagging technique in which
hundreds/thousands of decision trees are used to build the model. Random Forest
can be used for both classification and regression problems. It can be
used to classify loan applicants, identify fraudulent activity and predict
diseases.
- XG Boost
- XGBoost stands for eXtreme Gradient Boosting. It is a boosting technique that provides a
high-performance implementation of gradient boosted decision trees. The main
features of XG-Boost are it can handle missing data on its own, it supports
regularization and generally gives much more accurate results than other models.
- Support Vector Machines(SVM)
- Supervised Vector Machine is a supervised ML algorithm in which we plot
each data item as a point in n-dimensional space where n is the number of
features in the dataset. After, we perform classification by finding the
hyperplane that differentiates the classes very well. The data points which are
closest to the hyperplane are called support vectors. It can also be used for
regression problems but generally used in classification only. It is used in
many applications such as face detection, classification of mails, etc.
- Feature
Extraction
- Cross validation
- Clustering
- Scaling – Standardization, and Normalisation
Installation :
pip install -U scikit-learn
Usage :
import sklearn as skl