I have two tables in my database, each table has column with names. How to compare these tables columns and how to find these names what has exact matches and these names, which are similar in table1 and table2?
For example :
Table 1
AHMAD JAN AKHUNDZADA SHUKOOR AKHUNDZADA alias AHMAD JAN AKHUNZADA alias AHMAD JAN AKHUND ZADA Khatiba Imam Al-Bukhari (KIB) MOHAMMAD SADIQ alias AMIR MOHAMMAD Fuad
Table 2 :
Fuad Khatib Imam Al-Bukhari Khabiba Imam Al - Bukhari ahmad jan Akhunzada shukoor akhundzada AHMAD JAN AKHUNZADA AHMAD JAN AKHUND ZADA AMIR MOHAMMAD MOHAMMAD SADIQ
Result should come out with this:
Table 1 –> Table 2
MOHAMMAD SADIQ alias AMIR MOHAMMAD --> AMIR MOHAMMAD MOHAMMAD SADIQ alias AMIR MOHAMMAD --> MOHAMMAD SADIQ AHMAD JAN AKHUNDZADA SHUKOOR AKHUNDZADA alias AHMAD JAN AKHUNZADA alias AHMAD JAN AKHUND ZADA --> AHMAD JAN AKHUNZADA AHMAD JAN AKHUNDZADA SHUKOOR AKHUNDZADA alias AHMAD JAN AKHUNZADA alias AHMAD JAN AKHUND ZADA --> ahmad jan Akhunzada shukoor akhundzada AHMAD JAN AKHUNDZADA SHUKOOR AKHUNDZADA alias AHMAD JAN AKHUNZADA alias AHMAD JAN AKHUND ZADA --> AHMAD JAN AKHUND Z Khatiba Imam Al-Bukhari (KIB) --> Khatib Imam Al-Bukhari
how to effectively find similar names?
Advertisement
Answer
As I understand, there are some names in both tables that are not exactly the same and vary by some number of characters. For instance, there is a character d
missing in Akhun(d)zada
but present in AKHUNDZADA
.
To find similarity between 2 strings, the Jaccard distance or the Levenshtein distance UDFs can be used. The jaccard()
and levenshtein()
functions are built-in UDFs (community provided) in BigQuery. Other UDFs available in BigQuery can be found here.
Consider the below query for your use case. jaccard()
UDF has been used in this query.
SELECT distinct table_1.Name, table_2.Name as Name_table_2 FROM ( SELECT Name, trim(name_unnest) as name_trim FROM `project-id.dataset-id.table-1`, unnest(split(Name, "alias")) as name_unnest ) as table_1, `project-id.dataset-id.table-2` as table_2 where bqutil.fn.jaccard(lower(table_1.name_trim), lower(table_2.Name))>=0.8 order by Name;
Output of the query:
Also, please note that for the given set of inputs, a threshold of 0.8
works well. The threshold might not hold good for other inputs and has to be adjusted accordingly. The lower the threshold, the more distinct the names will become and lesser is the similarity.