Feature Engineering
- Delete or drop a row or columns with a
probability of data loss
- Following data imputation methods are
popular. But with categorical values encoding will predict poorly.
- Replace with Mean or Median or Mode
values of the column prevents the data loss but add some bias in to the
feature. This may be very useful in numerical continuous values
- Replace with LOCF i.e, Last observed
column Carried Forward / Backward More useful in Time Series.
- Using prediction by interpolation
or extrapolation by regression or classification algorithms with the
statistical finding of co-variance. It will be the proxy for a true value.
- Using Naïve Bayes or KNN or
Random Forest, the new values can be found. But SCIKIT-LEARN does not support
this.
- Using Deep learning algorithms to
find out the more accurate values. This process may be very slow in the case of
massive big data sets.
Please remember F2F class discussions.Feature Engineering
Feature is a useful column data for
better decisions and better insights.
Feature engineering combines the
feature generation and selection.
Feature Engineering is also
Standardization and generation.
It involves transformation of raw data
to data set for model building in data Science.
It makes raw data model ready, creates
features that make insights more useful. It is the most crucial and deciding
factor for make/break business decision process.
Let us first discuss about the various
ways of feature generation from raw data.
1. Encoding
2. Binning
3. Normalization
4. Standardization
5. Missing Value handling
6. Data Imputation
1. Encoding : It is the process of
converting categorical values to numerical.
a.
One Hot Encoding
i. If the unique values are more,
dimensionality increases. So for major unique value, we can encode as ‘1’
b. Label Encoding
i. Instead of handling male, female,
third geneder, we encode as 1, 2, 3. So numerical handling is better and
robust. But in the due course we impute ordinarily in this. We have to avoid
unnecessary false ordinality in such way they should be label encoded.
2. Binning : It is the process of
converting continuous numerical values to categorical values. Reverse of the
above. It is also called as bucketing since we put certain range in a single
bucket or bin using histogram techniques. Model will become more robust. Here
the problem of overfitting the model should be avoided.
3. Normalization: It is a scaling
technique for features. Suppose the data range is more wide than it is pretty
hard to put in a single graph for visualization. So taking log, you can solve
this. Or Min-Max Normalization, you can solve this issue to certain extent
depending on the nature of data set. Here we are scaling everything to 0 -1. Normally
it is done by formula
x1 = (x
–min(x))/(max(x)-min(x))
where x1 is
new value and x is the old value.
4. Standardization: It is also
called as z-Score scaling. z stands for zero score. It follows the standard
normal distribution property. Here we are scaling in the range to -1, 1. z is
defined by the formula
Z = x - MEAN(x)/ SD(X).
5. Handling Missing Values: Missing
means not stored, intentionally not observed or human error or not recorded or
not observed because of some reasons. May be blank, NULL, NaN. There are 3
types of Missing Values(MV)
a.
MCAR : Missing completely @ random due to human error
b.
MNAR : Missing Not @ Random.
c.
MAR : Missing @ Random.