Thursday 31 March 2022

Pandos #04 -Data Aggregation

 AGGREGATE

Data Frame Aggregation

Python has several methods are available to perform aggregations on data. It is done using the pandas and numpy libraries. Data Frame method support data aggregation. Let us see how we can apply:

import pandas as pd
import numpy as np

df = pd.DataFrame([[
1, 2, 3, 4, 5],
[4, 5, 6, 7, 8],
[7, 8, 9, 10, 11],
[np.nan, np.nan, np.nan,np.nan,np.nan]],
columns=['A', 'B', 'C', 'D', 'E'])

# over rows
dfagg = df.agg(['sum', 'min'])

print(dfagg)

"""
A B C D E sum 12.0 15.0 18.0 21.0 24.0 min 1.0 2.0 3.0 4.0 5.0
"""

Aggregating different aggregates over columns

# Different aggregate functions in columns

df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
print(df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}))
"""
A B
sum 12.0 NaN
min 1.0 2.0
max NaN 8.0
"""

This describe() method display all statistical properties like min, max, mean, std, 25%, 50%, 75% quartiles.

print(df.describe())

"""
A B C D E
count 3.0 3.0 3.0 3.0 3.0
mean 4.0 5.0 6.0 7.0 8.0
std 3.0 3.0 3.0 3.0 3.0
min 1.0 2.0 3.0 4.0 5.0
25% 2.5 3.5 4.5 5.5 6.5
50% 4.0 5.0 6.0 7.0 8.0
75% 5.5 6.5 7.5 8.5 9.5
max 7.0 8.0 9.0 10.0 11.0
"""

Transformation and manipulation on elements are very easy. Let us see some code snippets

Let us assume we want add +1 to all the above elements.

print(df.transform(lambda x: x + 1))
"""
A B C D E
0 2.0 3.0 4.0 5.0 6.0
1 5.0 6.0 7.0 8.0 9.0
2 8.0 9.0 10.0 11.0 12.0
3 NaN NaN NaN NaN NaN
"""

We an use groupby() too.


df = pd.DataFrame({
"Date": [
"2019-05-08", "2019-05-07", "2019-05-06", "2019-05-05",
"2019-05-08", "2019-05-07", "2019-05-06", "2019-05-05"],
"Data": [5, 8, 6, 1, 50, 100, 60, 120],
})
print(df)
"""
Date Data
0 2019-05-08 5
1 2019-05-07 8
2 2019-05-06 6
3 2019-05-05 1
4 2019-05-08 50
5 2019-05-07 100
6 2019-05-06 60
7 2019-05-05 120
"""

print(df.groupby('Date')['Data'].transform('sum'))

"""
0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data, dtype: int64
"""

We can groupby different levels of a hierarchical index using the level parameter: Please note the usage of pd.MultiIndex.from.arrays()  Method. So many ways, you can group by multiindex.



# Assume the array is like these : for concept..
arrays = [['male', 'male', 'female', 'female'],
['young', 'old', 'young', 'old']]

index = pd.MultiIndex.from_arrays(arrays, names=('Gender', 'Type'))
df = pd.DataFrame({'Max Enthu': [390., 350., 30., 20.]},
index=index)

print(df)

"""
Name: Data, dtype: int64
Max Enthu
Gender Type
male young 390.0
old 350.0
female young 30.0
old 20.0
"""

# using level 0
print(df.groupby(level=0).mean())
"""
Max Enthu
Gender
female 25.0
male 370.0

"""

# Using level
print(df.groupby(level="Gender").mean())
"""
Max Enthu
Type
old 185.0
young 210.0
"""

print(df.groupby(level="Gender").mean())
"""
Max Enthu
Gender
female 25.0
male 370.0
"""

Happy Learning at AMET!!!

























No comments:

Post a Comment

Making Prompts for Profile Web Site

  Prompt: Can you create prompt to craft better draft in a given topic. Response: Sure! Could you please specify the topic for which you...