Skip to content
Advertisement

Big Query Deduplication of rows with nested data

If having a table looking like this:

last_updated, uuid, headline, topics
2020-01-01, abc123, "BigQuery is nice", ['big query', 'data warehousing', 'trouble']
2020-01-02, abc123, "BigQuery is nice", ['big query', 'data warehousing', 'trouble']

And I only want to return one row, the last updated, how can I do this considering topics includes an array/nested field?

Do I need to unnest and nest again or is there some nifty trick?

Advertisement

Answer

You can use aggregation:

select array_agg(t order by last_updated limit 1)[ordinal(1)].*
from t
group by uid;

The above is BigQuery-specific. A more traditional method is to use row_number():

select t.* except (seqnum)
from (select t.*,
             row_number() over (partition by uid order by last_updated desc) as seqnum
      from t
     ) t
where seqnum = 1;
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement