DUPLICATE REMOVAL
In any Data set, Duplicates are perennial problem in data cleaning. Let us brief how we can handle duplicates in this article.
Method 1: (Traditional ..loop way)
# Create a list with duplicates
dlist = [10,20,30,40,50,60,10,20,30]
print(dlist)
# remove duplicates
dupFreeList = []
for element in dlist:
print(element)
if element not in dupFreeList:
dupFreeList.append(element)
#
print(dupFreeList) # [10, 20, 30, 40, 50, 60]
Method 2 : (Comprhensive Way)
res = []
[res.append(x) for x in dlist if x not in res]
# printing list after removal
print ("The list after removing duplicates : " + str(res))
# The list after removing duplicates : [10, 20, 30, 40, 50]
res = []
[res.append(x) for x in dlist if x not in res]
# printing list after removal
print ("The list after removing duplicates : " + str(res))
# The list after removing duplicates : [10, 20, 30, 40, 50]
Method 3:
You can convert to set and then convert to list to remove duplicates.
dlistset = set(dlist)
print(dlistset)
# {40, 10, 50, 20, 60, 30}
dupFreeList = list(dlistset)
print(dupFreeList) # [40, 10, 50, 20, 60, 30] # Order is not Maintained
Method 4:
from collections import OrderedDict
dupFreeList = list(OrderedDict.fromkeys(dlist))
print(dupFreeList) # [10, 20, 30, 40, 50, 60] # order is maintained
Here, we have imported package OrderedDict from collections and used the method list(OrderedDict.fromkeys(dlist))
Method 5: list(dict.fromkeys(df)) usage
dlist = ["10","20", "30","40","20","30"] # String
dflist = list(dict.fromkeys(dlist))
print(dlist, dflist)
#['10', '20', '30', '40', '20', '30'] ## ['10', '20', '30', '40']
dlist = [10,20,30,40,50,10,20] # integer
dflist = list(dict.fromkeys(dlist))
print(dlist, dflist) #[10, 20, 30, 40, 50, 10, 20] [10, 20, 30, 40, 50]
Happy Open Learning at AMET ODL!
No comments:
Post a Comment