Data Cleaning
Normally data is available with missing values, Null values, incorrect values, and inappropriate values
Major problem is missing values. It is very very common in real time.
How we can handle those values in python? Let us see.
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
Result:
one two three
a 0.335319 -0.298568 -2.062935
b NaN NaN NaN
c -1.739043 -0.912386 -0.675446
d NaN NaN NaN
e -0.462957 -1.445715 1.483821
f 0.901405 -1.162616 0.173550
g NaN NaN NaN
h -0.736636 1.685347 1.091092
In the above data frame , we could see NaN, not a Number.
Let us take another case for missing values.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'], columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df['one'].isnull())
Result:
a False
b True
c False
d True
e False
f False
g True
h False
Ok. Now we know the problem. How to rectify that. How to clean that.
Replace NaN with 0.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df)
print("C..NaN replaced with '0':")
print( df.fillna(0))
Result:
one two three
a 0.373935 -1.487100 -0.272034
b NaN NaN NaN
c 0.686059 0.286542 -0.093683
C..NaN replaced with '0':
one two three
a 0.373935 -1.487100 -0.272034
b 0.000000 0.000000 0.000000
c 0.686059 0.286542 -0.093683
Now we fill with 'pad' as shown below in the python script.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print( df.fillna(method='pad'))
.Result:
"""
one two three
a -1.764189 1.336129 0.512163
b -1.764189 1.336129 0.512163
c 1.495126 -0.165035 -1.719821
d 1.495126 -0.165035 -1.719821
e 1.273926 0.606101 1.416004
f 1.901047 1.813446 -0.263735
g 1.901047 1.813446 -0.263735
h -1.900605 0.052075 -2.418204
"""
Drop Missing Values by the following Example.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())
Result:
"""
one two three
a 1.177113 -0.471903 -0.779807
c -0.917548 -0.478030 0.128027
e -1.579338 0.950953 -2.017034
f -0.050153 -0.419798 -0.007029
h 1.207687 -1.491949 -0.895676
"""
Comparing the above two outputs, we clearly notice that rows b, d, g are dropped.
Replace missing values with scalar value are similar to fillna() function as shown below :
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print(df.replace({1000:10,2000:60}))
Result:
"""
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
"""
Hope fully from the above examples we understand the functions.
Happy Cleaning data with Python and Enjoy learning with Python!!!