Skip to content
Advertisement

near duplicate detection python sql

I have a huge data set which contains shipper/supplier names from different sources and are having near duplicate values in it. I tried so many different techniques available on the internet but none of them were quit satisfying or was too slow for this huge data.

I found this openrefine GitHub repo for fingerprinting algorithms and I added some more code and it solved my purpose. Have a look.

My dataset something looks like this…

Supplier Names Uncleaned

Advertisement

Answer

This is what the outpout data in an excel file looks like

Supplier Name Cleaned

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement