Example
with lhs_table as ( select 'Nia' as firstName, 'Johnson Jr.' as lastName, 'FEMALE' as gender, 'UNLV' as school, 'Mountain West Conference' as conference union all select 'Jana' as firstName, 'Abdullah' as lastName, 'FEMALE' as gender, 'Cincinnati' as school, 'American Athletic Conference' as conference union all select 'Kay' as firstName, 'Sieper' as lastName, 'FEMALE' as gender, 'Loyola Maryland' as school, 'Patriot League' as conference union all select 'Alessia' as firstName, 'Capley' as lastName, 'FEMALE' as gender, 'Presbyterian' as school, 'Big South Conference' as conference union all select 'Aaliyah' as firstName, 'Stanley' as lastName, 'FEMALE' as gender, 'FGCU' as school, 'ASUN Conference' as conference ), rhs_table as ( select 1611707 as playerId, 'Kayla' as firstName, 'Sieper' as lastName, 'Loyola Maryland' as teamMarket, 'Patriot League' as conferenceName union all select 1380430 as playerId, 'Jana' as firstName, 'Abdullah' as lastName, 'Cincinnati' as teamMarket, 'American Athletic Conference' as conferenceName union all select 1234567 as playerId, 'Mela' as firstName, 'Aravada' as lastName, 'Cincinnati' as teamMarket, 'American Athletic Conference' as conferenceName union all select 1354105 as playerId, 'Aaliyah' as firstName, 'Stanley' as lastName, 'FGCU' as teamMarket, 'ASUN Conference' as conferenceName union all select 1138439 as playerId, 'Aaliyah' as firstName, 'Stanley' as lastName, 'Emory' as teamMarket, 'ASUN Conference' as conferenceName union all select 996101 as playerId, 'Nia' as firstName, 'Johnson' as lastName, 'Emmanuel (GA)' as teamMarket, 'Conference Carolinas' as conferenceName union all select 977605 as playerId, 'Nia' as firstName, 'Johnson' as lastName, 'UNLV' as teamMarket, 'Mountain West Conference' as conferenceName union all select 1329967 as playerId, 'Alessia' as firstName, 'Capley' as lastName, 'Presbyterian' as teamMarket, 'Big South Conference' as conferenceName union all select 995234 as playerId, 'Nia' as firstName, 'Johnson' as lastName, 'Delta St.' as teamMarket, 'Gulf South Conference' as conferenceName union all select 4567890 as playerId, 'Britney' as firstName, 'Capley' as lastName, 'Presbyterian' as teamMarket, 'Big South Conference' as conferenceName ) select b.playerId ,a.* from lhs_table as a left join rhs_table as b on a.firstName = b.firstName and a.lastName = b.lastName and a.school = b.teamMarket and a.conference = b.conferenceName
We are looking to join rhs_table
onto lhs_table
for the playerIds. Every person in lhs_table
has a corresponding row in rhs_table
, however the joins are not so simple:
- For Nia Johnson Jr.,
Jr.
is missing inrhs_table
- For Kay Sieper, her full name
Kayla
is used inrhs_table
- We want to ignore (ie not left join) RHS players on the wrong team (Nia on Emmanuel, Aaliyah on Emory).
Because of these mismatches, we need to fuzzy match
instead. We have tried replacing on a.firstName = b.firstName
with on a.firstName like b.firstName
. Note that the conferences are the 1 column that do match exactly between tables, and also if it helps we can manually ensure that the teams match, although it would take some time. The important part is handling names not spelt the same.
The 5 correct playerIds, in order, are 977605, 1380430, 1611707, 1329967, 1354105
. Can we somehow fuzzy match to get these playerIds?
Advertisement
Answer
Consider below approach
select array_agg(b.playerId order by d limit 1) playerId, any_value(a).* from ( select `bqutil.fn.levenshtein`(a.firstName, b.firstName) / greatest(length(a.firstName), length(b.firstName)) + `bqutil.fn.levenshtein`(a.lastName, b.lastName) / greatest(length(a.lastName), length(b.lastName)) + `bqutil.fn.levenshtein`(a.school, b.teamMarket) / greatest(length(a.school), length(b.teamMarket)) + `bqutil.fn.levenshtein`(a.conference, b.conferenceName) / greatest(length(a.conference), length(b.conferenceName)) d, a, b from lhs_table as a cross join rhs_table as b ) group by to_json_string(a)
if applied to sample data in your question – output is
You can play/experiment with variation of above, like below as an example
select array_agg(b.playerId order by d limit 1) playerId, any_value(a).* from ( select `bqutil.fn.levenshtein`(format('%t', a), format('%t', b)) d, a, b from lhs_table as a cross join rhs_table as b ) group by to_json_string(a)