Thursday, 10 March 2022

Python#09

Data Cleaning

Normally data is available with missing values, Null values, incorrect values,  and inappropriate values 

Major problem is missing values. It is very very common in real time.

How we can handle those values in python? Let us see.


# import the pandas library

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])



Result:

        one       two     three

a  0.335319 -0.298568 -2.062935

b       NaN       NaN       NaN

c -1.739043 -0.912386 -0.675446

d       NaN       NaN       NaN

e -0.462957 -1.445715  1.483821

f  0.901405 -1.162616  0.173550

g       NaN       NaN       NaN

h -0.736636  1.685347  1.091092

In the above data frame ,  we could  see NaN, not a Number.

Let us take another case for missing values.


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'], columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].isnull())

Result: 

a    False

b     True

c    False

d     True

e    False

f    False

g     True

h    False

Ok. Now we know the problem. How to rectify that. How to clean that.

Replace NaN with 0. 

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df)
print("C..NaN replaced with '0':")
print( df.fillna(0))

Result:

        one       two     three

a  0.373935 -1.487100 -0.272034

b       NaN       NaN       NaN

c  0.686059  0.286542 -0.093683

C..NaN replaced with '0':

        one       two     three

a  0.373935 -1.487100 -0.272034

b  0.000000  0.000000  0.000000

c  0.686059  0.286542 -0.093683

Now we fill with 'pad' as shown below in the python script.


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print( df.fillna(method='pad'))

.Result:


"""
one two three
a -1.764189 1.336129 0.512163
b -1.764189 1.336129 0.512163
c 1.495126 -0.165035 -1.719821
d 1.495126 -0.165035 -1.719821
e 1.273926 0.606101 1.416004
f 1.901047 1.813446 -0.263735
g 1.901047 1.813446 -0.263735
h -1.900605 0.052075 -2.418204

"""
Drop Missing Values by the following Example.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())

Result:

"""
one two three
a 1.177113 -0.471903 -0.779807
c -0.917548 -0.478030 0.128027
e -1.579338 0.950953 -2.017034
f -0.050153 -0.419798 -0.007029
h 1.207687 -1.491949 -0.895676
"""
Comparing the above two outputs, we clearly notice that rows b, d, g are dropped.
Replace missing values with scalar  value are similar to  fillna() function as shown below :

import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})

print(df.replace({1000:10,2000:60}))

Result:

"""
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
"""

Hope fully from the above examples we understand the functions.
Happy Cleaning data with Python and  Enjoy learning with Python!!!

No comments:

Post a Comment

Making Prompts for Profile Web Site

  Prompt: Can you create prompt to craft better draft in a given topic. Response: Sure! Could you please specify the topic for which you...