Skip to content
Advertisement

De-duplicate rows in GCP Big Query (SQL) based on two columns [closed]

I’m trying to output all columns while have certain rows de-duplicated. Everything I’ve tried doesn’t seem to get anywhere close.

SELECT * FROM `project.dataset.table`
??
  • de-duplicate based on Name
  • where the chosen row to keep when de-duplicating Column1 uses the most recent date in LastUpdateDate
  • the row to keep will always be the last row (with the duplicate value at Name) as it would have been appended to the table later

Advertisement

Answer

Consider using ARRAY_AGG:

with TestData as (
  select 'Tom' as Name, '1' as Phone, timestamp('2020-01-01 00:00:00') as LastUpdateDate 
  union all
  select 'Tom' as Name, '2' as Phone, timestamp('2020-01-02 00:00:00') as LastUpdateDate
  union all
  select 'Eva' as Name, '3' as Phone, timestamp('2020-01-03 00:00:00') as LastUpdateDate
  union all
  select 'Eva' as Name, '4' as Phone, timestamp('2020-01-04 00:00:00') as LastUpdateDate
)
SELECT deduplicated.* FROM (
  SELECT ARRAY_AGG(t ORDER BY t.LastUpdateDate DESC LIMIT 1)[OFFSET(0)] as deduplicated
  FROM TestData as t 
  GROUP BY Name
)
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement