Skip to content
Advertisement

Is there an elegant way to aggregate data by groups based on elements of an array in SQL?

I’m looking for a way to aggregate data by groups that are based on elements in an array. I’m working with product and sales data in a BigQuery warehouse, and want to measure the sales of collections of products, where each product may be in more than one collection.

This is probably best explained with a toy example:

product_id | collections                       | number_sold
------------------------------------------------------------
1          |('pantry', 'coffee')               | 25
2          |('clothes', 'summer', 'essentials')| 13
3          |('coffee', 'beverages', 'summer')  | 9

Where the collections column above is an array of strings.

I want to get total sales for each collection:

collection | total_sold
-----------------------
pantry     | 25
coffee     | 34
clothes    | 13
summer     | 22
essentials | 13
beverages  | 9

Now I know I could do these one at a time with something like

WHERE 'coffee' IN UNNEST(collections)

or

SUM(CASE WHEN 'coffee' in UNNEST(collections) then number_sold else 0 end) as coffee_sold

But I’m looking for something more elegant that can handle lots and lots of collections and returns a table like the one above. Thank you all!

Advertisement

Answer

Consider below

select collection, sum(number_sold) total_sold
from `project.dataset.table` t, t.collections collection 
group by collection   

if applied to sample data in your question (… collections column … is an array of strings.)

with `project.dataset.table` as (                
  select 1 product_id, ['pantry', 'coffee'] collections, 25 number_sold union all
  select 2, ['clothes', 'summer', 'essentials'], 13 union all
  select 3, ['coffee', 'beverages', 'summer'], 9 
)

output is

enter image description here

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement