Skip to content
Advertisement

Delete repeated data in Bigquery

I am optimizing a query in Bigquery that shows non-repeated data, currently it is like this and it works.

select * from (select ROW_NUMBER() OVER (PARTITION BY id) as num, id, created_at, operator_id, description from NAME_TABLE where created_at >='2018-01-01') where num=1

I wanted to ask if it is possible to make a GROUP BY with all the columns (in a simple way it cannot be done, since crated_at is not possible to group it) and keep the first data of created_at that appears for each id

PD:a DISTINCT does not work, since there are more than 80 million records (they increase 2 million per day) and it returns repeated data

Advertisement

Answer

Below is for BigQuery Standard SQL

#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE` t
WHERE created_at >='2018-01-01'
GROUP BY id

Instead of processing / returning all columns – you can specify exact list you need as in below example

#standardSQL
SELECT AS VALUE ARRAY_AGG(STRUCT(id,created_at,operator_id,description) ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE`
WHERE created_at >='2018-01-01'
GROUP BY id   
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement