I’m looking for a way to aggregate data by groups that are based on elements in an array. I’m working with product and sales data in a BigQuery warehouse, and want to measure the sales of collections of products, where each product may be in more than one collection.
This is probably best explained with a toy example:
product_id | collections | number_sold ------------------------------------------------------------ 1 |('pantry', 'coffee') | 25 2 |('clothes', 'summer', 'essentials')| 13 3 |('coffee', 'beverages', 'summer') | 9
Where the collections
column above is an array of strings.
I want to get total sales for each collection:
collection | total_sold ----------------------- pantry | 25 coffee | 34 clothes | 13 summer | 22 essentials | 13 beverages | 9
Now I know I could do these one at a time with something like
WHERE 'coffee' IN UNNEST(collections)
or
SUM(CASE WHEN 'coffee' in UNNEST(collections) then number_sold else 0 end) as coffee_sold
But I’m looking for something more elegant that can handle lots and lots of collections and returns a table like the one above. Thank you all!
Advertisement
Answer
Consider below
select collection, sum(number_sold) total_sold from `project.dataset.table` t, t.collections collection group by collection
if applied to sample data in your question (… collections column … is an array of strings.)
with `project.dataset.table` as ( select 1 product_id, ['pantry', 'coffee'] collections, 25 number_sold union all select 2, ['clothes', 'summer', 'essentials'], 13 union all select 3, ['coffee', 'beverages', 'summer'], 9 )
output is