I want to de-dupe records in BigQuery with max column value on specific column with expression

Question

company | email | phone | website | address Amar CO LLC | amar@gmail.com | 123 | NULL | India Amar CO | amar@gmail.com | NULL | NULL | IND Stacks CO | stack@gmil.com | 910 | stacks.com | United ...

Accepted Answer

to give precedence to the record having minimum null values &#8230;Below is for BigQuery Standard SQL (query#1)#standardSQLselect   array_agg(t     order by array_length(regexp_extract_all(to_json_string(t), ':null'))     limit 1  )[offset(0)].*   replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company) from `project.dataset.table` tgroup by company if applied to sample data from your question &#8211; output isIn case if you want to fill all fields from all the records &#8211; you can use below (query#2)select regexp_replace(company, r'(?i)CO LLC', 'CO') as company,  max(email) email,  max(phone) phone,  max(website) website,  max(address) addressfrom `project.dataset.table`group by company and finally &#8211; if you still want to give precedence to the record having minimum null values, but the rest of nulls replace with values from other rows  &#8211; use below (query#3)select company,   ifnull(email, max_email) email,  ifnull(phone, max_phone) phone,  ifnull(website, max_website) website,  ifnull(address, max_address) addressfrom (  select array_agg(t       order by array_length(regexp_extract_all(to_json_string(t), ':null'))       limit 1    )[offset(0)].*     replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company),    max(email) max_email,     max(phone) max_phone,    max(website) max_website,    max(address) max_address  from `project.dataset.table` t  group by company )  you can test/check the difference between this and previous option by applying them to below dummy datawith `project.dataset.table` as (  select 'Amar CO LLC' company, 'amar@gmail.com' email, 123 phone, NULL website, 'India' address union all  select 'Amar CO', NULL, 222, 'amar.com', NULL union all  select 'Stacks CO LLC', 'stack@gmail.com', NULL, NULL, 'UK' union all  select 'Stacks CO', 'stack@gmil.com', 910, 'stacks.com', 'United Kingdom')the last query (query#3) giveswhile previous (query#2) will just give max across all rows

Advertisement

Answer