company | email | phone | website | address Amar CO LLC | amar@gmail.com | 123 | NULL | India Amar CO | amar@gmail.com | NULL | NULL | IND Stacks CO | stack@gmil.com | 910 | stacks.com | United Kingdom Stacks CO LLC | stack@gmail.com | NULL | NULL | UK
I want to drop the company name with CO LLC
instead want to keep Amar CO
but want all the columns from Amar CO LLC
as it has minimum NULL
values or maximum column data.
In short: De-dupe the records, remove the company name with ‘ending with or matching with LLC’ (case insensitive), but keep the values from both of the record which has maximum Information column.
Expected output
Amar CO | amar@gmail.com | 123 | NULL | India Stacks CO | stack@gmil.com | 910 | stacks.com | United Kingdom
Advertisement
Answer
to give precedence to the record having minimum null values …
Below is for BigQuery Standard SQL (query#1)
#standardSQL select array_agg(t order by array_length(regexp_extract_all(to_json_string(t), ':null')) limit 1 )[offset(0)].* replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company) from `project.dataset.table` t group by company
if applied to sample data from your question – output is
In case if you want to fill all fields from all the records – you can use below (query#2)
select regexp_replace(company, r'(?i)CO LLC', 'CO') as company, max(email) email, max(phone) phone, max(website) website, max(address) address from `project.dataset.table` group by company
and finally – if you still want to give precedence to the record having minimum null values, but the rest of nulls replace with values from other rows – use below (query#3)
select company, ifnull(email, max_email) email, ifnull(phone, max_phone) phone, ifnull(website, max_website) website, ifnull(address, max_address) address from ( select array_agg(t order by array_length(regexp_extract_all(to_json_string(t), ':null')) limit 1 )[offset(0)].* replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company), max(email) max_email, max(phone) max_phone, max(website) max_website, max(address) max_address from `project.dataset.table` t group by company )
you can test/check the difference between this and previous option by applying them to below dummy data
with `project.dataset.table` as ( select 'Amar CO LLC' company, 'amar@gmail.com' email, 123 phone, NULL website, 'India' address union all select 'Amar CO', NULL, 222, 'amar.com', NULL union all select 'Stacks CO LLC', 'stack@gmail.com', NULL, NULL, 'UK' union all select 'Stacks CO', 'stack@gmil.com', 910, 'stacks.com', 'United Kingdom' )
the last query (query#3) gives
while previous (query#2) will just give max across all rows