Friday, 1 April 2022

P#18 EDA

Exploratory Data Analysis

For the quick overview we can use following methods and attributes of a DataFrame: df

df.head() # show first 5 rows
df.tail() # last 5 rows
df.columns # list all column names
df.shape # get number of rows and columns
df.info() # additional info about dataframe
df.describe() # statistical description, only for numeric values
df['col_name'].value_counts(dropna=False) # count unique values in a column



import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('iris_csv.csv')
# print(df)
"""
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

[150 rows x 5 columns]
"""
# print(df.head()) # show first 5 rows
"""
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
"""
# print(df.tail()) # last 5 rows
"""
sepallength sepalwidth petallength petalwidth class
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
"""
#print(df.columns) # list all column names
#Index(['sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'class'], dtype='object')

# print(df.shape) # get number of rows and columns # (150, 5)

# print(df.info()) # additional info about dataframe
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepallength 150 non-null float64
1 sepalwidth 150 non-null float64
2 petallength 150 non-null float64
3 petalwidth 150 non-null float64
4 class 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
"""
# print(df.describe()) # statistical description, only for numeric values
"""
sepallength sepalwidth petallength petalwidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
"""
# df['col_name'].value_counts(dropna=False) #

#Another way to quickly check the data is by visualizing it.
#We use bar plots for discrete data counts
#and histogram for continuous.

y = df['sepalwidth']
plt.hist(y)
plt.savefig('eda1.png')
plt.show()


#
df.boxplot(column='sepallength', by='class')
plt.savefig('eda2.png')
plt.show()


## Scatter plot to depict relationship between two variables and show outliers if any

df.plot(kind='scatter', x='sepallength', y='class')
plt.savefig('edascatter.png')
plt.show()



Histogram and box plot can help to spot visually the outliers. The scatter plot shows relationship between 2 numeric variables

To check correlation between variable.

import seaborn as sns
plt.figure(figsize=(8,4))
sns.heatmap(df.corr(),cmap='Reds',annot=False)
plt.savefig('heatmap.png')
plt.show()

Above, positive correlation is represented by dark shades and negative correlation by lighter shades. Changes the value of annot=True, and the output will show you values by which features are correlated to each other in grid-cells.

k = 12
cols = df.corr().nlargest(k, 'sepallength')['sepallength'].index
cm = df[cols].corr()
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, cmap = 'viridis')
plt.savefig('sepallength.png')
plt.show()


From this figure we infer strong correlation between pedalwidth and pedallength since it has maximum + value. Funny! pennywise # Foolish!!!

We can check from heatmap, strong and weak correlation of all variable with their counterparts as shown above.

Happy learning at AMET -ODL!!!


No comments:

Post a Comment

Green Energy - House Construction

With Minimum Meterological data, how i can build model for Green Energy new construction WIth Minimum Meterological data, how i can build m...