Skip to content
Advertisement

Deduplicate table SQL with nested rows (type STRUCT)

I have a SQL table (in BigQuery) with possible duplicated rows. The table has over 20 columns, some of them are nested (data type “STRUCT)”. I want to deduplicate the table.

I can’t simply query SELECT DISTINCT * because I get an error

Query error: Column options of type STRUCT cannot be used in SELECT DISTINCT

So far, I tried to create a unique ID based on a hash of certain columns. I have now this unique ID (called sha256), but I can’t figure out a way of selecting only rows with unique hash.

I tried to GROUP BY, but it doesn’t work with STRUCT type, and I tried also to INNER JOIN with a table containing only unique hashed, but I get duplicates also.

For reference, here are 2 example rows of the dataset:

and the query I’m working on so far:

A solution would be to find a way of inner join the table_hashed and the table_unique_hash on the sha256 column…

Thanks for your help!

Advertisement

Answer

I found a workaround based on this topic. The combination of GROUP BY and ANY function for all the STRUCT columns made it work!

Note: my “price” field was previously an array; I transformed it in my source json to an int

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement