AGGREGATE

Data Frame Aggregation

Python has several methods are available to perform aggregations on data. It is done using the pandas and numpy libraries. Data Frame method support data aggregation. Let us see how we can apply:

import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3, 4, 5],
                   [4, 5, 6, 7, 8],
                   [7, 8, 9, 10, 11],
                   [np.nan, np.nan, np.nan,np.nan,np.nan]],
                  columns=['A', 'B', 'C', 'D', 'E'])

# over rows
dfagg = df.agg(['sum', 'min'])

print(dfagg)

"""
       A     B     C     D     E
sum  12.0  15.0  18.0  21.0  24.0
min   1.0   2.0   3.0   4.0   5.0
"""

Aggregating different aggregates over columns

# Different aggregate functions in columns

df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
print(df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}))
"""
       A    B
sum  12.0  NaN
min   1.0  2.0
max   NaN  8.0
"""

This describe() method display all statistical properties like min, max, mean, std, 25%, 50%, 75% quartiles.

print(df.describe())

"""
        A    B    C     D     E
count  3.0  3.0  3.0   3.0   3.0
mean   4.0  5.0  6.0   7.0   8.0
std    3.0  3.0  3.0   3.0   3.0
min    1.0  2.0  3.0   4.0   5.0
25%    2.5  3.5  4.5   5.5   6.5
50%    4.0  5.0  6.0   7.0   8.0
75%    5.5  6.5  7.5   8.5   9.5
max    7.0  8.0  9.0  10.0  11.0
"""

Transformation and manipulation on elements are very easy. Let us see some code snippets

Let us assume we want add +1 to all the above elements.

print(df.transform(lambda x: x + 1))
"""
     A    B     C     D     E
0  2.0  3.0   4.0   5.0   6.0
1  5.0  6.0   7.0   8.0   9.0
2  8.0  9.0  10.0  11.0  12.0
3  NaN  NaN   NaN   NaN   NaN
"""

We an use groupby() too.


df = pd.DataFrame({
    "Date": [
        "2019-05-08", "2019-05-07", "2019-05-06", "2019-05-05",
        "2019-05-08", "2019-05-07", "2019-05-06", "2019-05-05"],
    "Data": [5, 8, 6, 1, 50, 100, 60, 120],
})
print(df)
"""
        Date  Data
0  2019-05-08     5
1  2019-05-07     8
2  2019-05-06     6
3  2019-05-05     1
4  2019-05-08    50
5  2019-05-07   100
6  2019-05-06    60
7  2019-05-05   120
"""

print(df.groupby('Date')['Data'].transform('sum'))

"""
0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data, dtype: int64
"""

We can groupby different levels of a hierarchical index using the level parameter: Please note the usage of pd.MultiIndex.from.arrays() Method. So many ways, you can group by multiindex.


# Assume the array is like these : for concept..
arrays = [['male', 'male', 'female', 'female'],
          ['young', 'old', 'young', 'old']]


index = pd.MultiIndex.from_arrays(arrays, names=('Gender', 'Type'))
df = pd.DataFrame({'Max Enthu': [390., 350., 30., 20.]},
                  index=index)

print(df)


"""
Name: Data, dtype: int64
              Max Enthu
Gender Type            
male   young      390.0
       old        350.0
female young       30.0
       old         20.0
"""


# using level 0
print(df.groupby(level=0).mean())
"""
        Max Enthu
Gender           
female       25.0
male        370.0

"""


# Using level
print(df.groupby(level="Gender").mean())
"""
       Max Enthu
Type            
old        185.0
young      210.0
"""


print(df.groupby(level="Gender").mean())
"""
       Max Enthu
Gender           
female       25.0
male        370.0
"""

Happy Learning at AMET!!!

AMET-SOLID

Thursday, 31 March 2022

Pandos #04 -Data Aggregation

AGGREGATE

No comments:

Post a Comment

Green Energy - House Construction

Happy open and Distance Learning!

Blog Archive