Data Cleaning
Major problem is missing values. It is very very common in real time.
How we can handle those values in python? Let us see.
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
Result:
one two three
a 0.335319 -0.298568 -2.062935
b NaN NaN NaN
c -1.739043 -0.912386 -0.675446
d NaN NaN NaN
e -0.462957 -1.445715 1.483821
f 0.901405 -1.162616 0.173550
g NaN NaN NaN
h -0.736636 1.685347 1.091092
In the above data frame , we could see NaN, not a Number.
Let us take another case for missing values.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'], columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df['one'].isnull())
Result:
a False
b True
c False
d True
e False
f False
g True
h False
Ok. Now we know the problem. How to rectify that. How to clean that.
Replace NaN with 0.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df)
print("C..NaN replaced with '0':")
print( df.fillna(0))
Result:
one two three
a 0.373935 -1.487100 -0.272034
b NaN NaN NaN
c 0.686059 0.286542 -0.093683
C..NaN replaced with '0':
one two three
a 0.373935 -1.487100 -0.272034
b 0.000000 0.000000 0.000000
c 0.686059 0.286542 -0.093683
Now we fill with 'pad' as shown below in the python script.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print( df.fillna(method='pad'))
.Result:
"""
one two three
a -1.764189 1.336129 0.512163
b -1.764189 1.336129 0.512163
c 1.495126 -0.165035 -1.719821
d 1.495126 -0.165035 -1.719821
e 1.273926 0.606101 1.416004
f 1.901047 1.813446 -0.263735
g 1.901047 1.813446 -0.263735
h -1.900605 0.052075 -2.418204
"""
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())
"""
one two three
a 1.177113 -0.471903 -0.779807
c -0.917548 -0.478030 0.128027
e -1.579338 0.950953 -2.017034
f -0.050153 -0.419798 -0.007029
h 1.207687 -1.491949 -0.895676
"""
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print(df.replace({1000:10,2000:60}))
"""
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
"""
No comments:
Post a Comment