Sunday, 3 April 2022

P#19 Duplicates Handling

DUPLICATE REMOVAL

In any Data set,  Duplicates are perennial problem in data cleaning. Let us brief how we can handle duplicates in this article.

Method 1: (Traditional ..loop way)

# Create a list with duplicates

dlist = [10,20,30,40,50,60,10,20,30]
print(dlist)
# remove duplicates
dupFreeList = []
for element in dlist:
print(element)
if element not in dupFreeList:
dupFreeList.append(element)
#
print(dupFreeList) # [10, 20, 30, 40, 50, 60]

Method 2 : (Comprhensive Way)


res = []
[res.append(x) for x in dlist if x not in res]

# printing list after removal
print ("The list after removing duplicates : " + str(res))
# The list after removing duplicates : [10, 20, 30, 40, 50]

Method 3:

You can convert to set and then convert to list to remove duplicates.



dlistset = set(dlist)
print(dlistset)
# {40, 10, 50, 20, 60, 30}
dupFreeList = list(dlistset)
print(dupFreeList) # [40, 10, 50, 20, 60, 30] # Order is not Maintained


Method 4:


from collections import OrderedDict

dupFreeList = list(OrderedDict.fromkeys(dlist))

print(dupFreeList) # [10, 20, 30, 40, 50, 60] # order is maintained

Here, we have imported package OrderedDict from collections and used the method  list(OrderedDict.fromkeys(dlist))

Method 5: list(dict.fromkeys(df)) usage 


dlist = ["10","20", "30","40","20","30"] # String
dflist = list(dict.fromkeys(dlist))
print(dlist, dflist)
#['10', '20', '30', '40', '20', '30'] ## ['10', '20', '30', '40']


dlist = [10,20,30,40,50,10,20] # integer
dflist = list(dict.fromkeys(dlist))
print(dlist, dflist) #[10, 20, 30, 40, 50, 10, 20] [10, 20, 30, 40, 50]

Happy Open Learning at AMET ODL!

No comments:

Post a Comment

Green Energy - House Construction

With Minimum Meterological data, how i can build model for Green Energy new construction WIth Minimum Meterological data, how i can build m...