PIVOT TABLE
A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data.
The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.
The difference between pivot tables and GroupBy
can sometimes cause confusion; it helps me to think of pivot tables as essentially a multidimensional version of GroupBy
aggregation.
That is, you split-apply-combine, but both the split and the combine happen across not a one-dimensional index, but across a two-dimensional grid.
We will take one sample dataset birth.csv as below:
import pandas as pd
births = pd.read_csv('https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv ')
print(births)
"""
year month day gender births
0 1969 1 1.0 F 4046
1 1969 1 1.0 M 4440
2 1969 1 2.0 F 4454
3 1969 1 2.0 M 4548
4 1969 1 3.0 F 4548
... ... ... ... ... ...
15542 2008 10 NaN M 183219
15543 2008 11 NaN F 158939
15544 2008 11 NaN M 165468
15545 2008 12 NaN F 173215
15546 2008 12 NaN M 181235
[15547 rows x 5 columns]
"""
How to count male and female in a decade. So now we have to create column say decade and groupby Gender, and sum those rows.
births['decade'] = 10 * (births['year'] // 10) # function
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
print(births.pivot_table('births', index='decade', columns='gender', aggfunc='sum'))
"""
gender F M
decade
1960 1753634 1846572
1970 16263075 17121550
1980 18310351 19243452
1990 19479454 20420553
2000 18229309 19106428
"""
Have a deep breath. Look at the output. very easily found that male births outnumbered female births in all decades. How we will plot to explore visually
Cross Tab ( Contingency Table)
A contingency table is a tabular representation of categorical data . A contingency table usually shows frequencies for particular combinations of values of two discrete random variable s X and Y. Each cell in the table represents a mutually exclusive combination of X-Y values.
Used to summarize large data set.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('indian_food.csv') # From Kaggle
# print(df)
"""
name ... region
0 Balu shahi ... East
1 Boondi ... West
2 Gajar ka halwa ... North
3 Ghevar ... West
4 Gulab jamun ... East
.. ... ... ...
250 Til Pitha ... North East
251 Bebinca ... West
252 Shufta ... North
253 Mawa Bati ... Central
254 Pinaca ... West
[255 rows x 9 columns]
"""
#print(df.describe())
#print(df.shape)
#print(df.columns)
# Cross tab
# Compute a simple cross-tabulation of two (or more) factors.
# By default computes a frequency table of the factors unless an array of values and
# an aggregation function are passed.
# implementing crostab on state & diet columns
print(pd.crosstab(df['state'], df['diet']))
"""
diet non vegetarian vegetarian
state
-1 0 24
Andhra Pradesh 0 10
Assam 10 11
Bihar 0 3
Chhattisgarh 0 1
Goa 1 2
Gujarat 0 35
Haryana 0 1
Jammu & Kashmir 0 2
Karnataka 0 6
Kerala 1 7
Madhya Pradesh 0 2
Maharashtra 2 28
Manipur 1 1
NCT of Delhi 1 0
Nagaland 1 0
Odisha 0 7
Punjab 4 28
Rajasthan 0 6
Tamil Nadu 1 19
Telangana 1 4
Tripura 1 0
Uttar Pradesh 0 9
Uttarakhand 0 1
West Bengal 5 19
"""
print(pd.crosstab(df['region'], df['diet']))
"""
diet non vegetarian vegetarian
region
-1 0 13
Central 0 3
East 5 26
North 5 44
North East 13 12
South 3 56
West 3 71
"""
print(pd.crosstab(df['region'], df['diet'], normalize='all')) # Note Normalization
"""
diet non vegetarian vegetarian
region
-1 0.000000 0.051181
Central 0.000000 0.011811
East 0.019685 0.102362
North 0.019685 0.173228
North East 0.051181 0.047244
South 0.011811 0.220472
West 0.011811 0.279528
"""
print(pd.crosstab(df['region'], df['diet'], normalize='index')) # index normalization
# Plotting
pd.crosstab(df['region'], df['diet']).plot(kind='line')
plt.savefig('regionline.png')
plt.show()
plt.show()
pd.crosstab(df['region'], df['diet']).plot(kind='bar')
plt.savefig('regionbar.png')
plt.show()
pd.crosstab(df['region'], df['diet']).plot(kind='barh')
plt.savefig('regionbarh.png')
plt.show()
print(pd.crosstab(df['flavor_profile'], df['diet']).count)
"""
<bound method DataFrame.count of diet non vegetarian vegetarian
flavor_profile
-1 3 26
bitter 0 4
sour 0 1
spicy 26 107
sweet 0 88>
"""
crosstab has many uses. Here we discussed about very popular usage of crosstab.
No comments:
Post a Comment