Tuesday, 31 May 2022

Lab Exercises for MBA DS #2

Let us how Visualization helps in Exploratory Data Analysis or
An EDA Case Study

Lab exercise for Course work

1.  Python Environment setup and Essentials.
2.  Mathematical computing with Python (NumPy).
3.  Scientific Computing with Python (SciPy).
4.  Data Manipulation with Pandas.
5.  Prediction using Scikit-Learn
6.  Data Visualization in python using matplotlib

Q&A

Difference among between data Analysis, data Analytics and Data Science

Data Analytics	Data Analysis	Data Science
Super set Process of Data Analysis	Sub Set Process of Data Analytics	Set Process
Past and Present study and future e prediction	Past discovery
What will happen in future	What happened
Tools: Tableau, Zoho Analytics,Talend, Hadoop, Xplenty, Kafka,Python,R-Language, Cassandra, MongoDB, HPCC, Spark, Datawrapper, PowerBI etc.	Python, R, PowerBI, Tableau, Excel
Data analytics predicts ‘what will happen next or what is going to be next?’	Data analysis is actually studying past data to understand ‘what happened?’
All analysis and prediction. Data analytics life cycle consists of Business Case Evaluation, Data Identification, Data Acquisition & Filtering, Data Extraction, Data Validation & Cleansing, Data Aggregation & Representation, Data Analysis, Data Visualization, and Utilization of Analysis Results	DA involves Querying, wrangling, statistical modelling, analysis and visualization.	DS involves data sourcing, cleansing, modelling, result evaluation and result testing and deployment

Data Analytics and data Science

Data Science is a combination of multiple disciplines – Mathematics, Statistics, Computer Science, Information Science, Machine Learning, and Artificial Intelligence.

2. Difference between mutable and immutable objects in python

In Python, Built-in projects like int, float, bool, string, tuple are immutable.

In Python, Built-in projects like list, dict, set are mutable.

Friday, 27 May 2022

Distribution with Seaborn and Random

Distributions

This can be visually with the help of seaborn and numpy random.

Look at the following sample code:


from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns

# normal Distribuition
sns.distplot(random.normal(size=1000), hist=False)
plt.title(' Normal Distribuition')
plt.savefig('nd.jpg')
plt.show()

#binomial
sns.distplot(random.binomial(n=10, p=0.5, size=100), hist=True, kde=False)
plt.title(' Binomial Distribuition')
plt.savefig('bnd.jpg')
plt.show()


# poisson distribuition
sns.distplot(random.poisson(lam=2, size=100), kde=False)
plt.title('Poisson Distribuition')
plt.savefig('pois.jpg')
plt.show()


# All together
sns.distplot(random.normal(loc=50, scale=5, size=100), hist=False, label='normal')
sns.distplot(random.binomial(n=100, p=0.5, size=100), hist=False, label='binomial')
plt.title('Comparisosn between Normal and Binomial Distribuition')
plt.savefig('n-bnd.jpg')
plt.show()

sns.distplot(random.poisson(lam=2, size=100), kde=False)
plt.title('Poisson Distribuition')
plt.savefig('poisson.jpg')
plt.show()


sns.distplot(random.uniform(size=100), hist=False)
plt.title('Uniform Distribuition')
plt.savefig('uniform.jpg')
plt.show()

sns.distplot(random.logistic(size=1000), hist=False)
plt.title('Logistic Distribuition ')
plt.savefig('logistic.jpg')
plt.show()

sns.distplot(random.multinomial(n=6, pvals=[1/6, 1/6, 1/6, 1/6, 1/6, 1/6]))
plt.title('Multimodal Distribuition ')
plt.savefig('multimodal.jpg')
plt.show()


sns.distplot(random.exponential(size=1000), hist=False)
plt.title('Exponential Distribuition ')
plt.savefig('exponential.jpg')
plt.show()


sns.distplot(random.chisquare(df=1, size=1000), hist=False)
plt.title('Chisquare Distribuition ')
plt.savefig('chisquare.jpg')
plt.show()

sns.distplot(random.rayleigh(size=1000), hist=False)
plt.title('Rayleigh Distribuition')
plt.savefig('Rayleigh.jpg')
plt.show()

sns.distplot(random.pareto(a=2, size=1000), kde=False)
plt.title('Pareto Distribuition')
plt.savefig('Pareto.jpg')
plt.show()


x = random.zipf(a=2, size=1000)
sns.distplot(x[x<10], kde=False)
plt.title('Zipf Distribuition')
plt.savefig('zipf.jpg')
plt.show()

The following result graphs show how easy to plot with seaborn and random

Tuesday, 17 May 2022

DS#10 Visualization Basics

Data Visualization

Three basic principles (3s)

Standard, Simple & Scalable.

The main goal of data visualization is to make it easier to identify patterns, trends and outliers in large data sets. The term is often used interchangeably with others, including information graphics, information visualization and statistical graphics

1. Data visualizations should be used to empower a specific audience and address their needs - actionable and meaningful content.

2. Choose the right visual for your purpose - [Read]

3. Provide Context: Context engenders trust, which leads to action

4. Keep visualizations and dashboards simple and digestible

5. Design to keep users engaged

Ref :

Tables are used where users need to see the pattern of a specific parameter, while charts are used to show patterns or relationships in the data for one or more parameters

AIDA Formula of Persuasive viewing:

Attention − Hook the reader with an attention-grabbing sentence.
Interest − Create interest by mentioning benefits of what the reader likes.
Desire − Use middle paragraphs to prompt the reader towards action.
Action − Actions the reader is needed to take to get what he desires.

Ref :

Chart types Tutorials

Visualization Project Ideas

Data Visualization - A typical Reader

Mis-Leading Graphs

Simple Excel add-ins for Viz (Please try@Home Just for fun Learning)

Data Visualizer in Excel (Beyond Syllabus @ Home Just for Use)

Visualizing Guru Hans Rosling -Just to Have a Glance

Bill Gates Recommended this Book

GapMinder Slides : Videos:

Inspiring data visualization experts

Best Data Visualization

Python Easy on live data Reading

Tuesday, 10 May 2022

Python Basics Videos

Python : W.Y.S-W.Y.U Videos

(What You See What You Understand No Voice👪 )

See and Do Play List

DS#09 Feature Selection

Feature selection

What ?

Feature, refined input variable, selection is very important step for most predictive of a given outcome,

"Variable selection is a problem of selecting the subset of features such that accuracy of the induced classifier is maximal."

Why ?

To reduce cost of risk associated with observing variables.
To increase predictive power
To reduce the size of models, so they are easier to trust and understand
To understand the domain

Sample Problem:

Let M be metric, scoring a model and a feture subset acc. to predictions and features used

Let A be learning algorithm used to build the model

FSS problem1: Select a feature subsets, that maximizes the score that M gives to themodel learned by A using the features s

PBM 2: Selec a feature subset s and learner A': that maximizes the score M gives to the model learned by A' using S features.

M is accuracy + a preference for smaller models A is SVM

Find the minimal Feature subset that maximizes the accuracy of a SVM

other Possibilities for M calibrated accuracy AUC, trade-off b/w accuracy and cost of features.

Ref : Find the importance of feature:

Methods

Filters
Wrappers
Intrinsic
Hybrid

Selection of Features based on types of input and out put(target) data.

As we know there are numerical integer, numerical float, categorical nominal, categorical ordinal, categorical dichotomous data.

So the inputs, outputs and the methods are discussed below

Numerical input and numerical output .. Pearson's Coefficient method (linear), Spearman's Rank Correlation Method(Non-Linear)
Numerical input and categorical output.. ANOVA Correlation (for Linear), Kendall's Correlation (for Non-Linear)
Categorical input categorical output .. Chi-Squared test(Contingency Tables), Mutual Info
Categorical input Numerical output

To select top Variables we have to use SCIKIT library and SelectKBest() and SelectPercentile()

Ref : https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/

Customer Churn refers to loss of existing customer incurs heavy loss to any business .

References :

Feature selection : General Customer Churn case Study'

Churn Prediction : Churn Analysis ,

Churn Prediction : Commercial use of Data Science:

Formula Ref for Precision & Recall:

Sunday, 8 May 2022

Moodle Learn Videos

You Tube Video Links for Teachers

Moodle for Teachers/Content Creators

#01 Create a Category, Course, User and add Resource

#02 Create user, edit user profile and update user

#03 Bulk user creation

#04 Bulk course Creation

H5P: HTML 5 Contents for Moodle

#05 H5P Quiz -MCQ Creation and Usage

#06 H5P Drag the Right Words Activity Creation and Usage

#07 H5P Quiz - Division Creation and Usage

#08 Moodle Interactive video creation and usage

Moodle Videos for Examiner-Teachers/COE

#09 Quiz with Proctoring👮👮 Plugin Installation and Usage

#10 Moodle Virtual Programming Lab plugin installation with a python sample demo

Wednesday, 4 May 2022

DS#8 feature Engineering

Feature Engineering

Delete or drop a row or columns with a probability of data loss

Following data imputation methods are popular. But with categorical values encoding will predict poorly.

Replace with Mean or Median or Mode values of the column prevents the data loss but add some bias in to the feature. This may be very useful in numerical continuous values

Replace with LOCF i.e, Last observed column Carried Forward / Backward More useful in Time Series.

Using prediction by interpolation or extrapolation by regression or classification algorithms with the statistical finding of co-variance. It will be the proxy for a true value.

Using Naïve Bayes or KNN or Random Forest, the new values can be found. But SCIKIT-LEARN does not support this.

Using Deep learning algorithms to find out the more accurate values. This process may be very slow in the case of massive big data sets.

Please remember F2F class discussions.

Feature Engineering

Feature is a useful column data for better decisions and better insights.

Feature engineering combines the feature generation and selection.

Feature Engineering is also Standardization and generation.

It involves transformation of raw data to data set for model building in data Science.

It makes raw data model ready, creates features that make insights more useful. It is the most crucial and deciding factor for make/break business decision process.

Let us first discuss about the various ways of feature generation from raw data.

1. Encoding

2. Binning

3. Normalization

4. Standardization

5. Missing Value handling

6. Data Imputation

1. Encoding : It is the process of converting categorical values to numerical.

a. One Hot Encoding

i. If the unique values are more, dimensionality increases. So for major unique value, we can encode as ‘1’

b. Label Encoding

i. Instead of handling male, female, third geneder, we encode as 1, 2, 3. So numerical handling is better and robust. But in the due course we impute ordinarily in this. We have to avoid unnecessary false ordinality in such way they should be label encoded.

2. Binning : It is the process of converting continuous numerical values to categorical values. Reverse of the above. It is also called as bucketing since we put certain range in a single bucket or bin using histogram techniques. Model will become more robust. Here the problem of overfitting the model should be avoided.

3. Normalization: It is a scaling technique for features. Suppose the data range is more wide than it is pretty hard to put in a single graph for visualization. So taking log, you can solve this. Or Min-Max Normalization, you can solve this issue to certain extent depending on the nature of data set. Here we are scaling everything to 0 -1. Normally it is done by formula

x1 = (x –min(x))/(max(x)-min(x))

where x1 is new value and x is the old value.

4. Standardization: It is also called as z-Score scaling. z stands for zero score. It follows the standard normal distribution property. Here we are scaling in the range to -1, 1. z is defined by the formula

Z = x - MEAN(x)/ SD(X).

5. Handling Missing Values: Missing means not stored, intentionally not observed or human error or not recorded or not observed because of some reasons. May be blank, NULL, NaN. There are 3 types of Missing Values(MV)

a. MCAR : Missing completely @ random due to human error

b. MNAR : Missing Not @ Random.

c. MAR : Missing @ Random.

AMET-SOLID

Tuesday, 31 May 2022

Lab Exercises for MBA DS #2

Let us how Visualization helps in Exploratory Data Analysis or
An EDA Case Study

Lab exercise for Course work

Q&A

Friday, 27 May 2022

Distribution with Seaborn and Random

Distributions

Tuesday, 17 May 2022

DS#10 Visualization Basics

Data Visualization

Tuesday, 10 May 2022

Python Basics Videos

Python : W.Y.S-W.Y.U Videos

(What You See What You Understand No Voice👪 )

DS#09 Feature Selection

Feature selection

Sunday, 8 May 2022

Moodle Learn Videos

You Tube Video Links for Teachers

Moodle for Teachers/Content Creators

#01 Create a Category, Course, User and add Resource

#02 Create user, edit user profile and update user

#03 Bulk user creation

#04 Bulk course Creation

#05 H5P Quiz -MCQ Creation and Usage

#06 H5P Drag the Right Words Activity Creation and Usage

#07 H5P Quiz - Division Creation and Usage

#08 Moodle Interactive video creation and usage

Moodle Videos for Examiner-Teachers/COE

#09 Quiz with Proctoring👮👮 Plugin Installation and Usage

#10 Moodle Virtual Programming Lab plugin installation with a python sample demo

Wednesday, 4 May 2022

DS#8 feature Engineering

Click for kaggle FE Tutorial
Click for towardsdatascience FE techniques

Work Diary - 2025

Happy open and Distance Learning!

Blog Archive

Tuesday, 31 May 2022

Let us how Visualization helps in Exploratory Data Analysis or An EDA Case Study

Lab exercise for Course work

Friday, 27 May 2022

Distributions

Tuesday, 17 May 2022

Data Visualization

Tuesday, 10 May 2022

Python : W.Y.S-W.Y.U Videos

(What You See What You Understand No Voice👪 )

Feature selection

Sunday, 8 May 2022

You Tube Video Links for Teachers

Moodle for Teachers/Content Creators

Moodle Videos for Examiner-Teachers/COE

Wednesday, 4 May 2022

Click for kaggle FE TutorialClick for towardsdatascience FE techniques

Happy open and Distance Learning!

Blog Archive

Let us how Visualization helps in Exploratory Data Analysis or
An EDA Case Study

Click for kaggle FE Tutorial
Click for towardsdatascience FE techniques