Skip to content
Advertisement

I want to de-dupe records in BigQuery with max column value on specific column with expression

company  | email | phone | website | address
Amar CO LLC | amar@gmail.com | 123 | NULL | India
Amar CO | amar@gmail.com | NULL | NULL | IND
Stacks CO | stack@gmil.com | 910 | stacks.com | United Kingdom
Stacks CO LLC | stack@gmail.com | NULL | NULL | UK

I want to drop the company name with CO LLC instead want to keep Amar CO but want all the columns from Amar CO LLC as it has minimum NULL values or maximum column data.

In short: De-dupe the records, remove the company name with ‘ending with or matching with LLC’ (case insensitive), but keep the values from both of the record which has maximum Information column.

Expected output

Amar CO | amar@gmail.com | 123 | NULL | India
Stacks CO | stack@gmil.com | 910 | stacks.com | United Kingdom

Advertisement

Answer

to give precedence to the record having minimum null values …

Below is for BigQuery Standard SQL (query#1)

#standardSQL
select 
  array_agg(t 
    order by array_length(regexp_extract_all(to_json_string(t), ':null')) 
    limit 1
  )[offset(0)].* 
  replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company) 
from `project.dataset.table` t
group by company 

if applied to sample data from your question – output is

enter image description here

In case if you want to fill all fields from all the records – you can use below (query#2)

select regexp_replace(company, r'(?i)CO LLC', 'CO') as company,
  max(email) email,
  max(phone) phone,
  max(website) website,
  max(address) address
from `project.dataset.table`
group by company 

and finally – if you still want to give precedence to the record having minimum null values, but the rest of nulls replace with values from other rows – use below (query#3)

select company, 
  ifnull(email, max_email) email,
  ifnull(phone, max_phone) phone,
  ifnull(website, max_website) website,
  ifnull(address, max_address) address
from (
  select array_agg(t 
      order by array_length(regexp_extract_all(to_json_string(t), ':null')) 
      limit 1
    )[offset(0)].* 
    replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company),
    max(email) max_email, 
    max(phone) max_phone,
    max(website) max_website,
    max(address) max_address
  from `project.dataset.table` t
  group by company 
)  

you can test/check the difference between this and previous option by applying them to below dummy data

with `project.dataset.table` as (
  select 'Amar CO LLC' company, 'amar@gmail.com' email, 123 phone, NULL website, 'India' address union all
  select 'Amar CO', NULL, 222, 'amar.com', NULL union all
  select 'Stacks CO LLC', 'stack@gmail.com', NULL, NULL, 'UK' union all
  select 'Stacks CO', 'stack@gmil.com', 910, 'stacks.com', 'United Kingdom'
)

the last query (query#3) gives

enter image description here

while previous (query#2) will just give max across all rows

enter image description here

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement