Wednesday, 4 May 2022

DS#8 feature Engineering

Feature Engineering

  1. Delete or drop a row or columns with a probability of data loss
  1. Following data imputation methods are popular. But with categorical values encoding will predict poorly.
  1. Replace with Mean or Median or Mode values of the column prevents the data loss but add some bias in to the feature. This may be very useful in numerical continuous values
  1. Replace with LOCF i.e, Last observed column Carried Forward / Backward More useful in Time Series.
  1. Using prediction by interpolation or extrapolation by regression or classification algorithms with the statistical finding of co-variance. It will be the proxy for a true value.
  1. Using Naïve Bayes or KNN or Random Forest, the new values can be found. But SCIKIT-LEARN does not support this.
  1. Using Deep learning algorithms to find out the more accurate values. This process may be very slow in the case of massive big data sets.
Please remember F2F class discussions.

Feature Engineering

Feature is a useful column data for better decisions and better insights.

Feature engineering combines the feature generation and selection.

Feature Engineering is also Standardization and generation.

It involves transformation of raw data to data set for model building in data Science.

It makes raw data model ready, creates features that make insights more useful. It is the most crucial and deciding factor for make/break business decision process.

Let us first discuss about the various ways of feature generation from raw data.

1.  Encoding

2.  Binning

3.  Normalization

4.  Standardization

5.  Missing Value handling

6.  Data Imputation

1.  Encoding : It is the process of converting categorical values to numerical.

a.   One Hot Encoding

                                        i.    If the unique values are more, dimensionality increases. So for major unique value, we can  encode as ‘1’

b.  Label Encoding

                                        i.    Instead of handling male, female, third geneder, we encode as 1, 2, 3. So numerical handling is better and robust. But in the due course we impute ordinarily in this. We have to avoid unnecessary false ordinality in such way they should be label encoded.

2.  Binning : It is the process of converting continuous numerical values to categorical values. Reverse of the above. It is also called as bucketing since we put certain range in a single bucket or bin using histogram techniques. Model will become more robust. Here the problem of overfitting the model should be avoided.

3.  Normalization: It is a scaling technique for features. Suppose the data range is more wide than it is pretty hard to put in a single graph for visualization. So taking log, you can solve this. Or Min-Max Normalization, you can solve this issue to certain extent depending on the nature of data set. Here we are scaling everything to 0 -1. Normally it is done by formula

x1 = (x –min(x))/(max(x)-min(x))

where x1 is new value and x is the old value.

 

 4.  Standardization: It is also called as z-Score scaling. z stands for zero score. It follows the standard normal distribution property. Here we are scaling in the range to -1, 1. z is defined by the formula 

                        Z = x - MEAN(x)/ SD(X).

5.  Handling Missing Values: Missing means not stored, intentionally not observed or human error or not recorded or not observed because of some reasons. May be blank, NULL, NaN. There are 3 types of Missing Values(MV)

    a.    MCAR : Missing completely @ random due to human error

    b.    MNAR : Missing Not @ Random.

    c.     MAR   : Missing @ Random.

Click for kaggle  FE Tutorial
Click for towardsdatascience FE techniques




No comments:

Post a Comment

Making Prompts for Profile Web Site

  Prompt: Can you create prompt to craft better draft in a given topic. Response: Sure! Could you please specify the topic for which you...