Deduplicating rows in BigQuery based on condition

I’m currently trying to deduplicate rows within Google BigQuery for one of my tables. Basically, I have a table which bas rows containing duplicate values except for one single column. One example would be:

Please notice that all columns, except for the column “Ad group name” are the same. What I would like to do, is the following: keep a single row (wouldn’t matter which one) if all columns are the same but the column ad group name is different.

I was thinking of creating partitions and using a rank function to denotate different values within that partition. Something like:

RANK() OVER (PARTITION BY Adgroup ID, date, Sales, Cost ORDER BY Ad group name) AS rank

Theoretically (hehe) this should lead to something like:

Using this I could filter in a new subquery using WHERE Rank = 1. This would remove all the duplicate rows in that case.

However, I’m finding that BigQuery doesn’t support using FLOAT64 for partitioning, so my solution doesn’t work. Also, I think there’s a better way to do this but I’m having trouble finding out on how to do that. Which function in BigQuery can I leverage for this specifically?

Answer

You can use group by for this purpose:

select Adgroup ID, date, Sales, Cost, any_value(ad_group_name)
from t
group by Adgroup ID, date, Sales, Cost;

Advertisement

Answer