Tuesday 31 May 2022

Lab Exercises for MBA DS #2

 Let us how Visualization helps in Exploratory Data Analysis or 
An EDA Case Study


Lab exercise for Course work

1.  Python Environment setup and Essentials.
2.  Mathematical computing with Python (NumPy).
3.  Scientific Computing with Python (SciPy).
4.  Data Manipulation with Pandas.
5.  Prediction using Scikit-Learn
6.  Data Visualization in python using matplotlib








Q&A


Difference among between data Analysis, data Analytics and Data Science

Data Analytics

Data Analysis

Data Science

 

Super set Process of Data Analysis

Sub Set Process of Data Analytics

Set Process

 

Past and Present study and future e prediction

Past discovery

 

 

What will happen in future

What happened

 

 

Tools:

Tableau, Zoho Analytics,Talend, Hadoop, Xplenty, Kafka,Python,R-Language, Cassandra, MongoDB, HPCC, Spark, Datawrapper, PowerBI etc.

 

 

Python, R, PowerBI, Tableau, Excel

 

 

Data analytics predicts ‘what will happen next or what is going to be next?’ 

Data analysis is actually studying past data to understand ‘what happened?’ 

 

 

 

 

All analysis and prediction.

Data analytics life cycle consists of Business Case Evaluation, Data Identification, Data Acquisition & Filtering, Data Extraction, Data Validation & Cleansing, Data Aggregation & Representation, Data Analysis, Data Visualization, and Utilization of Analysis Results

DA involves Querying, wrangling, statistical modelling, analysis and visualization.

DS involves data sourcing, cleansing, modelling, result evaluation and result testing and deployment

 

Data Analytics and data Science

Data Science is a combination of multiple disciplines – Mathematics, Statistics, Computer Science, Information Science, Machine Learning, and Artificial Intelligence.


2. Difference between mutable and immutable objects in python

In Python, Built-in projects like int, float, bool, string, tuple are immutable.

In Python, Built-in projects like list, dict, set are mutable.

 


Friday 27 May 2022

Distribution with Seaborn and Random

Distributions


This can be visually with the help of seaborn and numpy random.

Look at the following sample code:



from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns

# normal Distribuition
sns.distplot(random.normal(size=1000), hist=False)
plt.title(' Normal Distribuition')
plt.savefig('nd.jpg')
plt.show()

#binomial
sns.distplot(random.binomial(n=10, p=0.5, size=100), hist=True, kde=False)
plt.title(' Binomial Distribuition')
plt.savefig('bnd.jpg')
plt.show()


# poisson distribuition
sns.distplot(random.poisson(lam=2, size=100), kde=False)
plt.title('Poisson Distribuition')
plt.savefig('pois.jpg')
plt.show()


# All together
sns.distplot(random.normal(loc=50, scale=5, size=100), hist=False, label='normal')
sns.distplot(random.binomial(n=100, p=0.5, size=100), hist=False, label='binomial')
plt.title('Comparisosn between Normal and Binomial Distribuition')
plt.savefig('n-bnd.jpg')
plt.show()

sns.distplot(random.poisson(lam=2, size=100), kde=False)
plt.title('Poisson Distribuition')
plt.savefig('poisson.jpg')
plt.show()


sns.distplot(random.uniform(size=100), hist=False)
plt.title('Uniform Distribuition')
plt.savefig('uniform.jpg')
plt.show()

sns.distplot(random.logistic(size=1000), hist=False)
plt.title('Logistic Distribuition ')
plt.savefig('logistic.jpg')
plt.show()

sns.distplot(random.multinomial(n=6, pvals=[1/6, 1/6, 1/6, 1/6, 1/6, 1/6]))
plt.title('Multimodal Distribuition ')
plt.savefig('multimodal.jpg')
plt.show()


sns.distplot(random.exponential(size=1000), hist=False)
plt.title('Exponential Distribuition ')
plt.savefig('exponential.jpg')
plt.show()


sns.distplot(random.chisquare(df=1, size=1000), hist=False)
plt.title('Chisquare Distribuition ')
plt.savefig('chisquare.jpg')
plt.show()

sns.distplot(random.rayleigh(size=1000), hist=False)
plt.title('Rayleigh Distribuition')
plt.savefig('Rayleigh.jpg')
plt.show()

sns.distplot(random.pareto(a=2, size=1000), kde=False)
plt.title('Pareto Distribuition')
plt.savefig('Pareto.jpg')
plt.show()


x = random.zipf(a=2, size=1000)
sns.distplot(x[x<10], kde=False)
plt.title('Zipf Distribuition')
plt.savefig('zipf.jpg')
plt.show()

The following result graphs show how easy to plot with seaborn and random








Tuesday 17 May 2022

DS#10 Visualization Basics

Data Visualization

Three basic principles (3s)

    Standard, Simple & Scalable.

The main goal of data visualization is to make it easier to identify patterns, trends and outliers in large data sets. The term is often used interchangeably with others, including information graphics, information visualization and statistical graphics


1. Data visualizations should be used to empower a specific audience and address their needs - actionable and meaningful content.

2. Choose the right visual for your purpose - [Read]

3. Provide Context: Context engenders trust, which leads to action

4. Keep visualizations and dashboards simple and digestible

5. Design to keep users engaged 

Ref : 


Tables are used where users need to see the pattern of a specific parameter, while charts are used to show patterns or relationships in the data for one or more parameters

AIDA Formula of Persuasive viewing:

  • Attention − Hook the reader with an attention-grabbing sentence.

  • Interest − Create interest by mentioning benefits of what the reader likes.

  • Desire − Use middle paragraphs to prompt the reader towards action.

  • Action − Actions the reader is needed to take to get what he desires.



Ref :

Chart types Tutorials


Visualization Project Ideas


Data Visualization - A typical Reader 


Mis-Leading Graphs


Simple Excel add-ins for Viz (Please try@Home Just for fun Learning)


Data Visualizer in Excel (Beyond Syllabus @ Home Just for Use)

Visualizing Guru Hans Rosling  -Just to Have a Glance

Bill Gates Recommended this Book 

GapMinder  Slides : Videos:




Tuesday 10 May 2022

Python Basics Videos

Python : W.Y.S-W.Y.U Videos

(What You See What You Understand No Voice👪 )

See and Do  Play List
  1. Baby program
  2. Keywords
  3. Statements
  4. Multiple Assignments and Comments
  5. Function and Doc String printing
  6. Input and Output
  7. Arithmetic Operators
  8. Relational, Logical & Bitwise Operators
  9. Identity & Membership Operators
  10. List basic operations
  11. List more Operations
  12. Set Operations
  13. Dictionary Operations
  14. Dictionary: by Comprehension, enumeration, Nested
  15. Lambda comprehension in Python
  16. Lambda with map and filter
  17. Scope of variables
  18. try except else finally 
  19. Easy ways to read on line data
  20. If conditional
  21. While loop
  22. for & nested for loop
  23. break Vs continue vs pass
  24. Date time  representation, comparison and arithmetic
  25. file operations
  26. file delete
  27. Class objects, methods and inheritance
  28. Array Handling made simple
  29. Pandas Nano  Course

DS#09 Feature Selection

 Feature selection

What ?

Feature,  refined input variable,   selection is very important step for most predictive of a given outcome, 

"Variable selection is a problem of selecting the subset of features such that accuracy of the induced classifier is maximal."

Why ?

  • To reduce cost of risk associated with observing variables.
  • To increase predictive power
  • To reduce the size of models, so they are easier to trust and understand
  • To understand the domain


Sample Problem:

Let M be metric, scoring a model and a feture subset acc. to predictions and features used

Let A be learning algorithm used to build the model

FSS problem1: Select a feature subsets, that maximizes the score that M gives to themodel learned by A using the features s

PBM 2: Selec a feature subset s and learner A': that maximizes the score M gives to the model learned by A' using S features.

M is accuracy + a preference for smaller models A is SVM

Find the minimal Feature subset that maximizes the accuracy of a SVM

other Possibilities for M calibrated accuracy AUC, trade-off  b/w accuracy and cost of features.

Ref :   Find the importance of feature: 

Methods

  1. Filters
  2. Wrappers
  3. Intrinsic
  4. Hybrid
Selection of Features based on types of input and out put(target) data.

As we know there are numerical integer, numerical float, categorical nominal, categorical ordinal, categorical dichotomous data.

So the inputs, outputs and the methods are discussed below
  1. Numerical input and numerical output .. Pearson's Coefficient method (linear), Spearman's Rank Correlation Method(Non-Linear)
  2. Numerical input and categorical output.. ANOVA Correlation (for Linear), Kendall's Correlation (for Non-Linear)
  3. Categorical input categorical output .. Chi-Squared test(Contingency Tables), Mutual Info
  4. Categorical input Numerical  output

To select top Variables we have to use SCIKIT library and SelectKBest() and SelectPercentile()

Ref : https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/

Customer Churn refers to loss of existing customer incurs heavy loss to any business . 

References :  

  Feature selection  : General   Customer Churn case Study'

 Churn Prediction   : Churn Analysis ,  

 Churn Prediction   : Commercial use of Data Science: 

Formula Ref for Precision & Recall

Making Prompts for Profile Web Site

  Prompt: Can you create prompt to craft better draft in a given topic. Response: Sure! Could you please specify the topic for which you...