Skip to content
Advertisement

Fuzzy matching multiple columns in BigQuery when left-joining

Example

enter image description here

We are looking to join rhs_table onto lhs_table for the playerIds. Every person in lhs_table has a corresponding row in rhs_table, however the joins are not so simple:

  • For Nia Johnson Jr., Jr. is missing in rhs_table
  • For Kay Sieper, her full name Kayla is used in rhs_table
  • We want to ignore (ie not left join) RHS players on the wrong team (Nia on Emmanuel, Aaliyah on Emory).

Because of these mismatches, we need to fuzzy match instead. We have tried replacing on a.firstName = b.firstName with on a.firstName like b.firstName. Note that the conferences are the 1 column that do match exactly between tables, and also if it helps we can manually ensure that the teams match, although it would take some time. The important part is handling names not spelt the same.

The 5 correct playerIds, in order, are 977605, 1380430, 1611707, 1329967, 1354105. Can we somehow fuzzy match to get these playerIds?

Advertisement

Answer

Consider below approach

if applied to sample data in your question – output is

enter image description here

You can play/experiment with variation of above, like below as an example

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement