Is there an elegant way to aggregate data by groups based on elements of an array in SQL?

I’m looking for a way to aggregate data by groups that are based on elements in an array. I’m working with product and sales data in a BigQuery warehouse, and want to measure the sales of collections of products, where each product may be in more than one collection.

This is probably best explained with a toy example:

product_id | collections                       | number_sold
------------------------------------------------------------
1          |('pantry', 'coffee')               | 25
2          |('clothes', 'summer', 'essentials')| 13
3          |('coffee', 'beverages', 'summer')  | 9

​x
 
product_id | collections                       | number_sold------------------------------------------------------------1          |('pantry', 'coffee')               | 252          |('clothes', 'summer', 'essentials')| 133          |('coffee', 'beverages', 'summer')  | 9​

Where the collections column above is an array of strings.

I want to get total sales for each collection:

collection | total_sold
-----------------------
pantry     | 25
coffee     | 34
clothes    | 13
summer     | 22
essentials | 13
beverages  | 9

 
collection | total_sold-----------------------pantry     | 25coffee     | 34clothes    | 13summer     | 22essentials | 13beverages  | 9​

Now I know I could do these one at a time with something like

WHERE 'coffee' IN UNNEST(collections)

 
WHERE 'coffee' IN UNNEST(collections)​

SUM(CASE WHEN 'coffee' in UNNEST(collections) then number_sold else 0 end) as coffee_sold

 
SUM(CASE WHEN 'coffee' in UNNEST(collections) then number_sold else 0 end) as coffee_sold​

But I’m looking for something more elegant that can handle lots and lots of collections and returns a table like the one above. Thank you all!

Answer

Consider below

select collection, sum(number_sold) total_sold
from `project.dataset.table` t, t.collections collection 
group by collection

 
select collection, sum(number_sold) total_soldfrom `project.dataset.table` t, t.collections collection group by collection   ​

if applied to sample data in your question (… collections column … is an array of strings.)

with `project.dataset.table` as (                
  select 1 product_id, ['pantry', 'coffee'] collections, 25 number_sold union all
  select 2, ['clothes', 'summer', 'essentials'], 13 union all
  select 3, ['coffee', 'beverages', 'summer'], 9 
)

 
with `project.dataset.table` as (                  select 1 product_id, ['pantry', 'coffee'] collections, 25 number_sold union all  select 2, ['clothes', 'summer', 'essentials'], 13 union all  select 3, ['coffee', 'beverages', 'summer'], 9 )​

output is

Advertisement

Answer