I am optimizing a query in Bigquery that shows non-repeated data, currently it is like this and it works.
select * from (select ROW_NUMBER() OVER (PARTITION BY id) as num, id, created_at, operator_id, description from NAME_TABLE where created_at >='2018-01-01') where num=1
I wanted to ask if it is possible to make a GROUP BY with all the columns (in a simple way it cannot be done, since crated_at is not possible to group it) and keep the first data of created_at that appears for each id
PD:a DISTINCT does not work, since there are more than 80 million records (they increase 2 million per day) and it returns repeated data
Advertisement
Answer
Below is for BigQuery Standard SQL
#standardSQL SELECT AS VALUE ARRAY_AGG(t ORDER BY created_at LIMIT 1)[OFFSET(0)] FROM `project.dataset.NAME_TABLE` t WHERE created_at >='2018-01-01' GROUP BY id
Instead of processing / returning all columns – you can specify exact list you need as in below example
#standardSQL SELECT AS VALUE ARRAY_AGG(STRUCT(id,created_at,operator_id,description) ORDER BY created_at LIMIT 1)[OFFSET(0)] FROM `project.dataset.NAME_TABLE` WHERE created_at >='2018-01-01' GROUP BY id