I’m currently trying to deduplicate rows within Google BigQuery for one of my tables. Basically, I have a table which bas rows containing duplicate values except for one single column. One example would be:
Please notice that all columns, except for the column “Ad group name” are the same. What I would like to do, is the following: keep a single row (wouldn’t matter which one) if all columns are the same but the column ad group name is different.
I was thinking of creating partitions and using a rank function to denotate different values within that partition. Something like:
RANK() OVER (PARTITION BY Adgroup ID, date, Sales, Cost ORDER BY Ad group name) AS rank
Theoretically (hehe) this should lead to something like:
Using this I could filter in a new subquery using WHERE Rank = 1
. This would remove all the duplicate rows in that case.
However, I’m finding that BigQuery doesn’t support using FLOAT64 for partitioning, so my solution doesn’t work. Also, I think there’s a better way to do this but I’m having trouble finding out on how to do that. Which function in BigQuery can I leverage for this specifically?
Advertisement
Answer
You can use group by
for this purpose:
select Adgroup ID, date, Sales, Cost, any_value(ad_group_name) from t group by Adgroup ID, date, Sales, Cost;