PIVOT TABLE

A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data.

The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.

The difference between pivot tables and GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a multidimensional version of GroupBy aggregation.

That is, you split-apply-combine, but both the split and the combine happen across not a one-dimensional index, but across a two-dimensional grid.

We will take one sample dataset birth.csv as below:

import pandas  as pd

births = pd.read_csv('https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv ')

print(births)
"""
      year  month  day gender  births
0      1969      1  1.0      F    4046
1      1969      1  1.0      M    4440
2      1969      1  2.0      F    4454
3      1969      1  2.0      M    4548
4      1969      1  3.0      F    4548
...     ...    ...  ...    ...     ...
15542  2008     10  NaN      M  183219
15543  2008     11  NaN      F  158939
15544  2008     11  NaN      M  165468
15545  2008     12  NaN      F  173215
15546  2008     12  NaN      M  181235

[15547 rows x 5 columns]

"""

How to count male and female in a decade. So now we have to create column say decade and groupby Gender, and sum those rows.


births['decade'] = 10 * (births['year'] // 10)  # function
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
print(births.pivot_table('births', index='decade', columns='gender', aggfunc='sum'))
"""
gender         F         M
decade                    
1960     1753634   1846572
1970    16263075  17121550
1980    18310351  19243452
1990    19479454  20420553
2000    18229309  19106428

"""

Have a deep breath. Look at the output. very easily found that male births outnumbered female births in all decades. How we will plot to explore visually

Cross Tab ( Contingency Table)

A contingency table is a tabular representation of categorical data . A contingency table usually shows frequencies for particular combinations of values of two discrete random variable s X and Y. Each cell in the table represents a mutually exclusive combination of X-Y values.

Used to summarize large data set.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('indian_food.csv') # From Kaggle
# print(df)
"""
               name  ...      region
0        Balu shahi  ...        East
1            Boondi  ...        West
2    Gajar ka halwa  ...       North
3            Ghevar  ...        West
4       Gulab jamun  ...        East
..              ...  ...         ...
250       Til Pitha  ...  North East
251         Bebinca  ...        West
252          Shufta  ...       North
253       Mawa Bati  ...     Central
254          Pinaca  ...        West

[255 rows x 9 columns]

"""
#print(df.describe())
#print(df.shape)
#print(df.columns)


# Cross tab
# Compute a simple cross-tabulation of two (or more) factors.
# By default computes a frequency table of the factors unless an array of values and
# an aggregation function are passed.
# implementing crostab on state & diet columns

print(pd.crosstab(df['state'], df['diet']))
"""
diet             non vegetarian  vegetarian
state                                      
-1                            0          24
Andhra Pradesh                0          10
Assam                        10          11
Bihar                         0           3
Chhattisgarh                  0           1
Goa                           1           2
Gujarat                       0          35
Haryana                       0           1
Jammu & Kashmir               0           2
Karnataka                     0           6
Kerala                        1           7
Madhya Pradesh                0           2
Maharashtra                   2          28
Manipur                       1           1
NCT of Delhi                  1           0
Nagaland                      1           0
Odisha                        0           7
Punjab                        4          28
Rajasthan                     0           6
Tamil Nadu                    1          19
Telangana                     1           4
Tripura                       1           0
Uttar Pradesh                 0           9
Uttarakhand                   0           1
West Bengal                   5          19
"""



print(pd.crosstab(df['region'], df['diet']))
"""
diet        non vegetarian  vegetarian
region                                
-1                       0          13
Central                  0           3
East                     5          26
North                    5          44
North East              13          12
South                    3          56
West                     3          71
"""

print(pd.crosstab(df['region'], df['diet'], normalize='all'))  # Note Normalization

"""
diet        non vegetarian  vegetarian
region                                
-1                0.000000    0.051181
Central           0.000000    0.011811
East              0.019685    0.102362
North             0.019685    0.173228
North East        0.051181    0.047244
South             0.011811    0.220472
West              0.011811    0.279528
"""

print(pd.crosstab(df['region'], df['diet'], normalize='index')) # index normalization

# Plotting
pd.crosstab(df['region'], df['diet']).plot(kind='line')


plt.savefig('regionline.png')

plt.show()

plt.show()

pd.crosstab(df['region'], df['diet']).plot(kind='bar')
plt.savefig('regionbar.png')
plt.show()



pd.crosstab(df['region'], df['diet']).plot(kind='barh')
plt.savefig('regionbarh.png')
plt.show()


print(pd.crosstab(df['flavor_profile'], df['diet']).count)
"""
<bound method DataFrame.count of diet            non vegetarian  vegetarian
flavor_profile
-1                           3          26
bitter                       0           4
sour                         0           1
spicy                       26         107
sweet                        0          88>
"""

crosstab has many uses. Here we discussed about very popular usage of crosstab.

AMET-SOLID

Friday, 1 April 2022

Pandas #05 Pivot Tables

PIVOT TABLE

Cross Tab ( Contingency Table)

No comments:

Post a Comment

Making Prompts for Profile Web Site

Happy open and Distance Learning!

Blog Archive