Big Query Deduplication of rows with nested data

If having a table looking like this:

last_updated, uuid, headline, topics
2020-01-01, abc123, "BigQuery is nice", ['big query', 'data warehousing', 'trouble']
2020-01-02, abc123, "BigQuery is nice", ['big query', 'data warehousing', 'trouble']

​x
 
last_updated, uuid, headline, topics2020-01-01, abc123, "BigQuery is nice", ['big query', 'data warehousing', 'trouble']2020-01-02, abc123, "BigQuery is nice", ['big query', 'data warehousing', 'trouble']​

And I only want to return one row, the last updated, how can I do this considering topics includes an array/nested field?

Do I need to unnest and nest again or is there some nifty trick?

Answer

You can use aggregation:

select array_agg(t order by last_updated limit 1)[ordinal(1)].*
from t
group by uid;

 
select array_agg(t order by last_updated limit 1)[ordinal(1)].*from tgroup by uid;​

The above is BigQuery-specific. A more traditional method is to use row_number():

select t.* except (seqnum)
from (select t.*,
             row_number() over (partition by uid order by last_updated desc) as seqnum
      from t
     ) t
where seqnum = 1;

 
select t.* except (seqnum)from (select t.*,             row_number() over (partition by uid order by last_updated desc) as seqnum      from t     ) twhere seqnum = 1;​

Advertisement

Answer