Tuesday, 10 May 2022

Python : W.Y.S-W.Y.U Videos

(What You See What You Understand No Voice👪 )

See and Do Play List

Feature selection

What ?

Feature, refined input variable, selection is very important step for most predictive of a given outcome,

"Variable selection is a problem of selecting the subset of features such that accuracy of the induced classifier is maximal."

Why ?

To reduce cost of risk associated with observing variables.
To increase predictive power
To reduce the size of models, so they are easier to trust and understand
To understand the domain

Sample Problem:

Let M be metric, scoring a model and a feture subset acc. to predictions and features used

Let A be learning algorithm used to build the model

FSS problem1: Select a feature subsets, that maximizes the score that M gives to themodel learned by A using the features s

PBM 2: Selec a feature subset s and learner A': that maximizes the score M gives to the model learned by A' using S features.

M is accuracy + a preference for smaller models A is SVM

Find the minimal Feature subset that maximizes the accuracy of a SVM

other Possibilities for M calibrated accuracy AUC, trade-off b/w accuracy and cost of features.

Ref : Find the importance of feature:

Methods

Filters
Wrappers
Intrinsic
Hybrid

Selection of Features based on types of input and out put(target) data.

As we know there are numerical integer, numerical float, categorical nominal, categorical ordinal, categorical dichotomous data.

So the inputs, outputs and the methods are discussed below

Numerical input and numerical output .. Pearson's Coefficient method (linear), Spearman's Rank Correlation Method(Non-Linear)
Numerical input and categorical output.. ANOVA Correlation (for Linear), Kendall's Correlation (for Non-Linear)
Categorical input categorical output .. Chi-Squared test(Contingency Tables), Mutual Info
Categorical input Numerical output

To select top Variables we have to use SCIKIT library and SelectKBest() and SelectPercentile()

Ref : https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/

Customer Churn refers to loss of existing customer incurs heavy loss to any business .

References :

Feature selection : General Customer Churn case Study'

Churn Prediction : Churn Analysis ,

Churn Prediction : Commercial use of Data Science:

Formula Ref for Precision & Recall:

Sunday, 8 May 2022

You Tube Video Links for Teachers

Moodle for Teachers/Content Creators

#01 Create a Category, Course, User and add Resource

#02 Create user, edit user profile and update user

#05 H5P Quiz -MCQ Creation and Usage

#06 H5P Drag the Right Words Activity Creation and Usage

#07 H5P Quiz - Division Creation and Usage

#08 Moodle Interactive video creation and usage

Moodle Videos for Examiner-Teachers/COE

#09 Quiz with Proctoring👮👮 Plugin Installation and Usage

#10 Moodle Virtual Programming Lab plugin installation with a python sample demo

Wednesday, 4 May 2022

DS#8 feature Engineering

Feature Engineering

Delete or drop a row or columns with a probability of data loss

Following data imputation methods are popular. But with categorical values encoding will predict poorly.

Replace with Mean or Median or Mode values of the column prevents the data loss but add some bias in to the feature. This may be very useful in numerical continuous values

Replace with LOCF i.e, Last observed column Carried Forward / Backward More useful in Time Series.

Using prediction by interpolation or extrapolation by regression or classification algorithms with the statistical finding of co-variance. It will be the proxy for a true value.

Using Naïve Bayes or KNN or Random Forest, the new values can be found. But SCIKIT-LEARN does not support this.

Using Deep learning algorithms to find out the more accurate values. This process may be very slow in the case of massive big data sets.

Please remember F2F class discussions.

Feature Engineering

Feature is a useful column data for better decisions and better insights.

Feature engineering combines the feature generation and selection.

Feature Engineering is also Standardization and generation.

It involves transformation of raw data to data set for model building in data Science.

It makes raw data model ready, creates features that make insights more useful. It is the most crucial and deciding factor for make/break business decision process.

Let us first discuss about the various ways of feature generation from raw data.

1. Encoding

2. Binning

3. Normalization

4. Standardization

5. Missing Value handling

6. Data Imputation

1. Encoding : It is the process of converting categorical values to numerical.

a. One Hot Encoding

i. If the unique values are more, dimensionality increases. So for major unique value, we can encode as ‘1’

b. Label Encoding

i. Instead of handling male, female, third geneder, we encode as 1, 2, 3. So numerical handling is better and robust. But in the due course we impute ordinarily in this. We have to avoid unnecessary false ordinality in such way they should be label encoded.

2. Binning : It is the process of converting continuous numerical values to categorical values. Reverse of the above. It is also called as bucketing since we put certain range in a single bucket or bin using histogram techniques. Model will become more robust. Here the problem of overfitting the model should be avoided.

3. Normalization: It is a scaling technique for features. Suppose the data range is more wide than it is pretty hard to put in a single graph for visualization. So taking log, you can solve this. Or Min-Max Normalization, you can solve this issue to certain extent depending on the nature of data set. Here we are scaling everything to 0 -1. Normally it is done by formula

x1 = (x –min(x))/(max(x)-min(x))

where x1 is new value and x is the old value.

4. Standardization: It is also called as z-Score scaling. z stands for zero score. It follows the standard normal distribution property. Here we are scaling in the range to -1, 1. z is defined by the formula

Z = x - MEAN(x)/ SD(X).

5. Handling Missing Values: Missing means not stored, intentionally not observed or human error or not recorded or not observed because of some reasons. May be blank, NULL, NaN. There are 3 types of Missing Values(MV)

a. MCAR : Missing completely @ random due to human error

b. MNAR : Missing Not @ Random.

c. MAR : Missing @ Random.

Click for kaggle FE Tutorial
Click for towardsdatascience FE techniques

Friday, 29 April 2022

Plots Distribuition

Firs see the histogram

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

# Load the data
df = pd.read_csv('nft.csv')
# Extract feature we're interested in
data = df['release_year']

# Generate histogram/distribution plot
sns.displot(data)
plt.savefig('hist1.png')
plt.show()


data = df['release_year']
sns.displot(data, discrete = True, kde = True)
plt.savefig('dist_hist.png')
plt.show()

See the histogram with Distribution Curve

Like these we can combine histogram and distribution curve to understand things better

Code to join plot:


df.dropna(inplace=True)  # to drop null value otherwise, seaborn has trouble to convert null
sns.jointplot(x = "rating", y = "release_year", data = df)
plt.savefig('joint.png')
plt.show()

Sea born is very easy to plot thes types of charts

Tuesday, 26 April 2022

Lab Exercises for MBA DS #1

1. Draw Charts for TEU for MAERSK and SHI (Shipping corporation of India)

Aim : Infer TEU for Maersk and SHI

Procedure:

import libraries

import matplotlib.pyplot as plt
import numpy as np

define dataset


year = [2016, 2017,2018,2019,2020, 2021]
Maersk = [12000, 14000, 17000, 21000, 9000, 12000 ]
sci = [200,800, 1500, 1700, 150, 600]
scin = [2,8, 15, 17, 15, 6]

Maersk_new = np.log10(Maersk) 

print(Maersk_new)     #[4.07918125 4.14612804 4.23044892 4.32221929 3.95424251 4.07918125]

sci_new = np.log10(sci)

print(sci_new)        #[2.30103    2.90308999 3.17609126 3.23044892 2.17
609126 2.77815125]

Plot graphs with actual values,

def plot1():
    fig = plt.figure(figsize=(5, 4))
    plt.plot(year, Maersk,'b-', year, sci,'r-')
    plt.title('Multiple Plots  for original vaues')
    plt.xlabel('Year')
    plt.ylabel('TEU')
    plt.grid()
    plt.savefig('renga1.png')

plt.show()

plot1()

Plot with modified values

def plot2():
    fig = plt.figure(figsize=(5, 4))
    plt.plot(year, Maersk_new,'b-', year, sci_new,'r-')
    plt.title('Multiple Plots values for log10 Values')
    plt.xlabel('Year')
    plt.ylabel('TEU')
    plt.grid()
    plt.savefig('renga2.png')
    plt.show()

plot2()

Plot with log values

def plot3():
    fig = plt.figure(figsize=(5, 4))
    plt.plot(year, Maersk_new,'b-', year, scin,'r-')
    plt.title('Multiple Plots values for SCI-Small Values')
    plt.xlabel('Year')
    plt.ylabel('TEU')
    plt.grid()
    plt.savefig('renga3.png')
    plt.show()

plot3()

Multiple sub plots for comparison in a column

def plot4():

    plt.title('MultipleSub  Plots  Column wise')
    # plot 1:
    plt.subplot(3, 1, 1)
    plt.plot(year, Maersk, year,sci)
    plt.ylabel('TEU')
    plt.title('Year Vs Maersk TEU')

    # plot 2:
    plt.subplot(3, 1, 2)
    plt.plot(year, Maersk_new, year, sci_new)
    plt.xlabel('Year')
    plt.title('Year Vs Sci_new')

    # plot 3:
    plt.subplot(3, 1, 3)
    plt.plot(year, Maersk_new, year, scin)
    plt.title('Year Vs Scin')


    plt.savefig('renga4.png')
    plt.show()

plot4()

From the above charts we can infer which is better Maersk or SHI???

Happy Learning with AMET ODL🚢🚢🚢🚢🛳🛳🛳🛳😆

DS#7 EDA Notes

Exploratory Data Analysis

EDA:

Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations.

Get maximum insights from a data set.
Uncover underlying structure.
Extract important variables from the dataset.
Detect outliers and anomalies(if any)
Test underlying assumptions.
Determine the optimal factor settings

EDA Techniques

Univariate Non-graphical.
Multivariate Non-graphical.
Univariate graphical.
Multivariate graphical.

Types of Data Analysis

Several data analysis techniques exist encompassing various domains such as business, science, social science, etc. with a variety of names.

The major data analysis approaches are −

Data Mining
Business Intelligence
Statistical Analysis
Predictive Analytics
Text Analytics
Data Mining

Data Mining is the analysis of large quantities of data to extract previously unknown, interesting patterns of data, unusual data and the dependencies. Note that the goal is the extraction of patterns and knowledge from large amounts of data and not the extraction of data itself.Data mining analysis involves computer science methods at the intersection of the artificial intelligence, machine learning, statistics, and database systems. The patterns obtained from data mining can be considered as a summary of the input data that can be used in further analysis or to obtain more accurate prediction results by a decision support system.

Business Intelligence

Business Intelligence techniques and tools are for acquisition and transformation of large amounts of unstructured business data to help identify, develop and create new strategic business opportunities.
The goal of business intelligence is to allow easy interpretation of large volumes of data to identify new opportunities. It helps in implementing an effective strategy based on insights that can provide businesses with a competitive market-advantage and long-term stability.

Statistical Analysis : Statistics is the study of collection, analysis, interpretation, presentation, and organization of data.

In data analysis, two main statistical methodologies are used −
Descriptive statistics − In descriptive statistics, data from the entire population or a sample is summarized with numerical descriptors such as −

Mean, Standard Deviation for Continuous Data
Frequency, Percentage for Categorical Data
Inferential statistics − It uses patterns in the sample data to draw inferences about the represented population or accounting for randomness. These inferences can be −

answering yes/no questions about the data (hypothesis testing)
estimating numerical characteristics of the data (estimation)
describing associations within the data (correlation)
modeling relationships within the data (E.g. regression analysis)

Predictive Analytics

Predictive Analytics use statistical models to analyze current and historical data for forecasting (predictions) about future or otherwise unknown events. In business, predictive analytics is used to identify risks and opportunities that aid in decision-making.

Text Analytics

Text Analytics, also referred to as Text Mining or as Text Data Mining is the process of deriving high-quality information from text. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data using means such as statistical pattern learning, and finally evaluation and interpretation of the output.

Definition:

Data Analysis is defined by the statistician John Tukey in 1961 as "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”

Thus, data analysis is a process for obtaining large, unstructured data from various sources and converting it into information that is useful for −

Answering questions
Test hypotheses
Decision-making
Disproving theories