Monday, 4 April 2022

P#21 3D Plots

 3d Plots are easy to plot using axes3d.  Let us see some examples to understand the 3d plots.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plt.plot([2,4,6],[4,8,12],color='Red')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.savefig('3dline41.png')
# plt.show()
The above one is 3d Line. we have set projection by ax = fig.add_subplot(111, projection='3d')
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt

mpl.rcParams['legend.fontsize'] = 10
fig = plt.figure()
ax = fig.gca(projection='3d')
theta = np.linspace(-4 * np.pi, 4 * np.pi, 100)
z = np.linspace(-2, 2, 100)
r = z**2 + 1
x = r * np.sin(theta)
y = r * np.cos(theta)
ax.plot(x, y, z, label='parametric curve')
ax.legend()
plt.savefig('3dline42.png')
plt.show()

Now let us plot 3d scatter plot.

from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt

def randrange(n, vmin, vmax):
return (vmax - vmin)*np.random.rand(n) + vmin

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

n = 100

# For each set of style and range settings, plot n random points in the box
# defined by x in [23, 32], y in [0, 100], z in [zlow, zhigh].
for c, m, zlow, zhigh in [('r', 'o', -50, -25), ('b', '^', -30, -5)]:
xs = randrange(n, 23, 32)
ys = randrange(n, 0, 100)
zs = randrange(n, zlow, zhigh)
ax.scatter(xs, ys, zs, c=c, marker=m)

ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
# plt.savefig('3dscatter43.png')
plt.show()
Please note all the labels are printed corresponding label() methods.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Grab some test data.
X, Y, Z = axes3d.get_test_data(0.05)

# Plot a basic wireframe.
ax.plot_wireframe(X, Y, Z, rstride=10, cstride=10)
plt.savefig('3dwireframe44.png')
plt.show()


Please note the wireframe model diagram and corresponding code.
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import numpy as np


fig = plt.figure()
ax = fig.gca(projection='3d')

# Make data.
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)

# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,
linewidth=0, antialiased=False)

# Customize the z axis.
ax.set_zlim(-1.01, 1.01)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.savefig('3dsurface45.png')
plt.show()

                           


from mpl_toolkits.mplot3d import Axes3D
from matplotlib.collections import PolyCollection
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
import numpy as np


fig = plt.figure()
ax = fig.gca(projection='3d')


def cc(arg):
return mcolors.to_rgba(arg, alpha=0.6)

xs = np.arange(0, 10, 0.4)
verts = []
zs = [0.0, 1.0, 2.0, 3.0]
for z in zs:
ys = np.random.rand(len(xs))
ys[0], ys[-1] = 0, 0
verts.append(list(zip(xs, ys)))

poly = PolyCollection(verts, facecolors=[cc('r'), cc('g'), cc('b'),
cc('y')])
poly.set_alpha(0.7)
ax.add_collection3d(poly, zs=zs, zdir='y')

ax.set_xlabel('X')
ax.set_xlim3d(0, 10)
ax.set_ylabel('Y')
ax.set_ylim3d(-1, 4)
ax.set_zlabel('Z')
ax.set_zlim3d(0, 1)
plt.savefig('3dploy46.png')
plt.show()


See the beauty of color. Check how it is nicely handled for better visualization. happy Visualizing 3d with python, matplotlib and AMET.
 

P#20 Slicing

SLICING

Slicing is the extraction of a part of a string, list, or tuple. It enables users to access the specific range of elements by mentioning their indices. 

We should understand index. Then only slicing will be easier.

 An index is a position of an individual character or element in a list, tuple, or string. The index value always starts at zero and ends at one less than the number of items.

#slice list

aList = [10,20,30, 40, 50]
slc = slice(1,4) # note slice method
print(aList[slc]) #[20, 30, 40]

# slice Tuple

Slicing on Tuple.

aTuple = (10,20,30, 40, 50)
slc = slice(1,4) # note slice method
print(aTuple[slc]) # (20, 30, 40)
# stepping
aList = [10,20,30, 40, 50,60, 70, 80, 90]
stepList = slice(1,8,2)
print(aList[stepList]) # [20, 40, 60, 80]

#Insertion @ Start
aList = [10,20,30, 40, 50,60, 70, 80, 90]
iList = ['a','b','c','d']
aList[:0] = iList
print(aList) # ['a', 'b', 'c', 'd', 10, 20, 30, 40, 50, 60, 70, 80, 90]

#Insertion @ End
aList = [10,20,30, 40, 50,60, 70, 80, 90]
iList = ['a','b','c','d']
aList[len(aList):] = iList
print(aList) # [10, 20, 30, 40, 50, 60, 70, 80, 90, 'a', 'b', 'c', 'd']
alist = "AMET ODL LEARNING"
print(alist[::-1]) # Reverse as string #GNINRAEL LDO TEMA

alist = "AMET ODL LEARNING"

print(alist[4:8]) # returns 4 to 7th characters in the string # ODL
print(alist[0:4]) # from 1 to 4 {4 exclusive} # AMET
print(alist[9::]) # From 10 to last character #LEARNING
print(alist[:8]) # From begining to 8th position #AMET ODL
print(alist[-1]) # last character G
See the above uses of indexing in slicing. Interestingly, it works even for negative indexing.  
Deleting an element

aList.remove(10)

print(aList) # [20, 30, 40, 50, 60, 70, 80, 90, 'a', 'b', 'c', 'd']

Happy Learning at AMET ODL!👌👪



Sunday, 3 April 2022

P#19 Duplicates Handling

DUPLICATE REMOVAL

In any Data set,  Duplicates are perennial problem in data cleaning. Let us brief how we can handle duplicates in this article.

Method 1: (Traditional ..loop way)

# Create a list with duplicates

dlist = [10,20,30,40,50,60,10,20,30]
print(dlist)
# remove duplicates
dupFreeList = []
for element in dlist:
print(element)
if element not in dupFreeList:
dupFreeList.append(element)
#
print(dupFreeList) # [10, 20, 30, 40, 50, 60]

Method 2 : (Comprhensive Way)


res = []
[res.append(x) for x in dlist if x not in res]

# printing list after removal
print ("The list after removing duplicates : " + str(res))
# The list after removing duplicates : [10, 20, 30, 40, 50]

Method 3:

You can convert to set and then convert to list to remove duplicates.



dlistset = set(dlist)
print(dlistset)
# {40, 10, 50, 20, 60, 30}
dupFreeList = list(dlistset)
print(dupFreeList) # [40, 10, 50, 20, 60, 30] # Order is not Maintained


Method 4:


from collections import OrderedDict

dupFreeList = list(OrderedDict.fromkeys(dlist))

print(dupFreeList) # [10, 20, 30, 40, 50, 60] # order is maintained

Here, we have imported package OrderedDict from collections and used the method  list(OrderedDict.fromkeys(dlist))

Method 5: list(dict.fromkeys(df)) usage 


dlist = ["10","20", "30","40","20","30"] # String
dflist = list(dict.fromkeys(dlist))
print(dlist, dflist)
#['10', '20', '30', '40', '20', '30'] ## ['10', '20', '30', '40']


dlist = [10,20,30,40,50,10,20] # integer
dflist = list(dict.fromkeys(dlist))
print(dlist, dflist) #[10, 20, 30, 40, 50, 10, 20] [10, 20, 30, 40, 50]

Happy Open Learning at AMET ODL!

Friday, 1 April 2022

P#18 EDA

Exploratory Data Analysis

For the quick overview we can use following methods and attributes of a DataFrame: df

df.head() # show first 5 rows
df.tail() # last 5 rows
df.columns # list all column names
df.shape # get number of rows and columns
df.info() # additional info about dataframe
df.describe() # statistical description, only for numeric values
df['col_name'].value_counts(dropna=False) # count unique values in a column



import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('iris_csv.csv')
# print(df)
"""
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

[150 rows x 5 columns]
"""
# print(df.head()) # show first 5 rows
"""
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
"""
# print(df.tail()) # last 5 rows
"""
sepallength sepalwidth petallength petalwidth class
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
"""
#print(df.columns) # list all column names
#Index(['sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'class'], dtype='object')

# print(df.shape) # get number of rows and columns # (150, 5)

# print(df.info()) # additional info about dataframe
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepallength 150 non-null float64
1 sepalwidth 150 non-null float64
2 petallength 150 non-null float64
3 petalwidth 150 non-null float64
4 class 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
"""
# print(df.describe()) # statistical description, only for numeric values
"""
sepallength sepalwidth petallength petalwidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
"""
# df['col_name'].value_counts(dropna=False) #

#Another way to quickly check the data is by visualizing it.
#We use bar plots for discrete data counts
#and histogram for continuous.

y = df['sepalwidth']
plt.hist(y)
plt.savefig('eda1.png')
plt.show()


#
df.boxplot(column='sepallength', by='class')
plt.savefig('eda2.png')
plt.show()


## Scatter plot to depict relationship between two variables and show outliers if any

df.plot(kind='scatter', x='sepallength', y='class')
plt.savefig('edascatter.png')
plt.show()



Histogram and box plot can help to spot visually the outliers. The scatter plot shows relationship between 2 numeric variables

To check correlation between variable.

import seaborn as sns
plt.figure(figsize=(8,4))
sns.heatmap(df.corr(),cmap='Reds',annot=False)
plt.savefig('heatmap.png')
plt.show()

Above, positive correlation is represented by dark shades and negative correlation by lighter shades. Changes the value of annot=True, and the output will show you values by which features are correlated to each other in grid-cells.

k = 12
cols = df.corr().nlargest(k, 'sepallength')['sepallength'].index
cm = df[cols].corr()
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, cmap = 'viridis')
plt.savefig('sepallength.png')
plt.show()


From this figure we infer strong correlation between pedalwidth and pedallength since it has maximum + value. Funny! pennywise # Foolish!!!

We can check from heatmap, strong and weak correlation of all variable with their counterparts as shown above.

Happy learning at AMET -ODL!!!


Pandas #05 Pivot Tables

PIVOT TABLE

pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data. 

The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. 

The difference between pivot tables and GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a multidimensional version of GroupBy aggregation. 

That is, you split-apply-combine, but both the split and the combine happen across not a one-dimensional index, but across a two-dimensional grid.

We will take one sample dataset birth.csv as below:

import pandas  as pd

births = pd.read_csv('https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv ')

print(births)
"""
year month day gender births
0 1969 1 1.0 F 4046
1 1969 1 1.0 M 4440
2 1969 1 2.0 F 4454
3 1969 1 2.0 M 4548
4 1969 1 3.0 F 4548
... ... ... ... ... ...
15542 2008 10 NaN M 183219
15543 2008 11 NaN F 158939
15544 2008 11 NaN M 165468
15545 2008 12 NaN F 173215
15546 2008 12 NaN M 181235

[15547 rows x 5 columns]

"""

How to count male and female in a decade. So now we have to create column say decade and groupby Gender, and sum those rows.


births['decade'] = 10 * (births['year'] // 10) # function
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
print(births.pivot_table('births', index='decade', columns='gender', aggfunc='sum'))
"""
gender F M
decade
1960 1753634 1846572
1970 16263075 17121550
1980 18310351 19243452
1990 19479454 20420553
2000 18229309 19106428

"""

Have a deep breath. Look at the output. very easily found that male births outnumbered female births in all decades. How we will plot to explore visually

Cross Tab ( Contingency Table)

A contingency table is a tabular representation of categorical data . A contingency table usually shows frequencies for particular combinations of values of two discrete random variable s X and Y. Each cell in the table represents a mutually exclusive combination of X-Y values.

Used to summarize large data set.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('indian_food.csv') # From Kaggle
# print(df)
"""
name ... region
0 Balu shahi ... East
1 Boondi ... West
2 Gajar ka halwa ... North
3 Ghevar ... West
4 Gulab jamun ... East
.. ... ... ...
250 Til Pitha ... North East
251 Bebinca ... West
252 Shufta ... North
253 Mawa Bati ... Central
254 Pinaca ... West

[255 rows x 9 columns]

"""
#print(df.describe())
#print(df.shape)
#print(df.columns)


# Cross tab
# Compute a simple cross-tabulation of two (or more) factors.
# By default computes a frequency table of the factors unless an array of values and
# an aggregation function are passed.
# implementing crostab on state & diet columns

print(pd.crosstab(df['state'], df['diet']))
"""
diet non vegetarian vegetarian
state
-1 0 24
Andhra Pradesh 0 10
Assam 10 11
Bihar 0 3
Chhattisgarh 0 1
Goa 1 2
Gujarat 0 35
Haryana 0 1
Jammu & Kashmir 0 2
Karnataka 0 6
Kerala 1 7
Madhya Pradesh 0 2
Maharashtra 2 28
Manipur 1 1
NCT of Delhi 1 0
Nagaland 1 0
Odisha 0 7
Punjab 4 28
Rajasthan 0 6
Tamil Nadu 1 19
Telangana 1 4
Tripura 1 0
Uttar Pradesh 0 9
Uttarakhand 0 1
West Bengal 5 19
"""



print(pd.crosstab(df['region'], df['diet']))
"""
diet non vegetarian vegetarian
region
-1 0 13
Central 0 3
East 5 26
North 5 44
North East 13 12
South 3 56
West 3 71
"""

print(pd.crosstab(df['region'], df['diet'], normalize='all')) # Note Normalization

"""
diet non vegetarian vegetarian
region
-1 0.000000 0.051181
Central 0.000000 0.011811
East 0.019685 0.102362
North 0.019685 0.173228
North East 0.051181 0.047244
South 0.011811 0.220472
West 0.011811 0.279528
"""

print(pd.crosstab(df['region'], df['diet'], normalize='index')) # index normalization

# Plotting
pd.crosstab(df['region'], df['diet']).plot(kind='line')

plt.savefig('regionline.png')
plt.show()
plt.show()

pd.crosstab(df['region'], df['diet']).plot(kind='bar')
plt.savefig('regionbar.png')
plt.show()


pd.crosstab(df['region'], df['diet']).plot(kind='barh')
plt.savefig('regionbarh.png')
plt.show()


print(pd.crosstab(df['flavor_profile'], df['diet']).count)
"""
<bound method DataFrame.count of diet non vegetarian vegetarian
flavor_profile
-1 3 26
bitter 0 4
sour 0 1
spicy 26 107
sweet 0 88>
"""

crosstab has many uses. Here we discussed about very popular usage of crosstab. 

Thursday, 31 March 2022

Pandos #04 -Data Aggregation

 AGGREGATE

Data Frame Aggregation

Python has several methods are available to perform aggregations on data. It is done using the pandas and numpy libraries. Data Frame method support data aggregation. Let us see how we can apply:

import pandas as pd
import numpy as np

df = pd.DataFrame([[
1, 2, 3, 4, 5],
[4, 5, 6, 7, 8],
[7, 8, 9, 10, 11],
[np.nan, np.nan, np.nan,np.nan,np.nan]],
columns=['A', 'B', 'C', 'D', 'E'])

# over rows
dfagg = df.agg(['sum', 'min'])

print(dfagg)

"""
A B C D E sum 12.0 15.0 18.0 21.0 24.0 min 1.0 2.0 3.0 4.0 5.0
"""

Aggregating different aggregates over columns

# Different aggregate functions in columns

df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
print(df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}))
"""
A B
sum 12.0 NaN
min 1.0 2.0
max NaN 8.0
"""

This describe() method display all statistical properties like min, max, mean, std, 25%, 50%, 75% quartiles.

print(df.describe())

"""
A B C D E
count 3.0 3.0 3.0 3.0 3.0
mean 4.0 5.0 6.0 7.0 8.0
std 3.0 3.0 3.0 3.0 3.0
min 1.0 2.0 3.0 4.0 5.0
25% 2.5 3.5 4.5 5.5 6.5
50% 4.0 5.0 6.0 7.0 8.0
75% 5.5 6.5 7.5 8.5 9.5
max 7.0 8.0 9.0 10.0 11.0
"""

Transformation and manipulation on elements are very easy. Let us see some code snippets

Let us assume we want add +1 to all the above elements.

print(df.transform(lambda x: x + 1))
"""
A B C D E
0 2.0 3.0 4.0 5.0 6.0
1 5.0 6.0 7.0 8.0 9.0
2 8.0 9.0 10.0 11.0 12.0
3 NaN NaN NaN NaN NaN
"""

We an use groupby() too.


df = pd.DataFrame({
"Date": [
"2019-05-08", "2019-05-07", "2019-05-06", "2019-05-05",
"2019-05-08", "2019-05-07", "2019-05-06", "2019-05-05"],
"Data": [5, 8, 6, 1, 50, 100, 60, 120],
})
print(df)
"""
Date Data
0 2019-05-08 5
1 2019-05-07 8
2 2019-05-06 6
3 2019-05-05 1
4 2019-05-08 50
5 2019-05-07 100
6 2019-05-06 60
7 2019-05-05 120
"""

print(df.groupby('Date')['Data'].transform('sum'))

"""
0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data, dtype: int64
"""

We can groupby different levels of a hierarchical index using the level parameter: Please note the usage of pd.MultiIndex.from.arrays()  Method. So many ways, you can group by multiindex.



# Assume the array is like these : for concept..
arrays = [['male', 'male', 'female', 'female'],
['young', 'old', 'young', 'old']]

index = pd.MultiIndex.from_arrays(arrays, names=('Gender', 'Type'))
df = pd.DataFrame({'Max Enthu': [390., 350., 30., 20.]},
index=index)

print(df)

"""
Name: Data, dtype: int64
Max Enthu
Gender Type
male young 390.0
old 350.0
female young 30.0
old 20.0
"""

# using level 0
print(df.groupby(level=0).mean())
"""
Max Enthu
Gender
female 25.0
male 370.0

"""

# Using level
print(df.groupby(level="Gender").mean())
"""
Max Enthu
Type
old 185.0
young 210.0
"""

print(df.groupby(level="Gender").mean())
"""
Max Enthu
Gender
female 25.0
male 370.0
"""

Happy Learning at AMET!!!

























Work Diary - 2025

Learnt: Date Link 28.01.2025 https://ametodl.blogspot.com/p/experience-with-deepseek-...