Thursday, 10 March 2022

Python#09

Data Cleaning

Normally data is available with missing values, Null values, incorrect values, and inappropriate values

Major problem is missing values. It is very very common in real time.

How we can handle those values in python? Let us see.


# import the pandas library

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

Result:

one two three

a 0.335319 -0.298568 -2.062935

b NaN NaN NaN

c -1.739043 -0.912386 -0.675446

d NaN NaN NaN

e -0.462957 -1.445715 1.483821

f 0.901405 -1.162616 0.173550

g NaN NaN NaN

h -0.736636 1.685347 1.091092

In the above data frame , we could see NaN, not a Number.

Let us take another case for missing values.


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
                                                'h'], columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].isnull())

Result:

a False

b True

c False

d True

e False

f False

g True

h False

Ok. Now we know the problem. How to rectify that. How to clean that.

Replace NaN with 0.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df)
print("C..NaN replaced with '0':")
print( df.fillna(0))

Result:

one two three

a 0.373935 -1.487100 -0.272034

b NaN NaN NaN

c 0.686059 0.286542 -0.093683

C..NaN replaced with '0':

one two three

a 0.373935 -1.487100 -0.272034

b 0.000000 0.000000 0.000000

c 0.686059 0.286542 -0.093683

Now we fill with 'pad' as shown below in the python script.


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print( df.fillna(method='pad'))

.Result:


"""
       one       two     three
a -1.764189  1.336129  0.512163
b -1.764189  1.336129  0.512163
c  1.495126 -0.165035 -1.719821
d  1.495126 -0.165035 -1.719821
e  1.273926  0.606101  1.416004
f  1.901047  1.813446 -0.263735
g  1.901047  1.813446 -0.263735
h -1.900605  0.052075 -2.418204

"""

Drop Missing Values by the following Example.


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())

Result:


"""
        one       two     three
a  1.177113 -0.471903 -0.779807
c -0.917548 -0.478030  0.128027
e -1.579338  0.950953 -2.017034
f -0.050153 -0.419798 -0.007029
h  1.207687 -1.491949 -0.895676
"""

Comparing the above two outputs, we clearly notice that rows b, d, g are dropped.

Replace missing values with scalar value are similar to fillna() function as shown below :


import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})

print(df.replace({1000:10,2000:60}))

Result:

"""
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
"""

Hope fully from the above examples we understand the functions.

Happy Cleaning data with Python and Enjoy learning with Python!!!

AMET-SOLID

Thursday, 10 March 2022

Python#09

Data Cleaning

No comments:

Post a Comment

Work Diary - 2025

Happy open and Distance Learning!

Blog Archive