I’m looking for a way to aggregate data by groups that are based on elements in an array. I’m working with product and sales data in a BigQuery warehouse, and want to measure the sales of collections of products, where each product may be in more than one collection.
This is probably best explained with a toy example:
x
product_id | collections | number_sold
------------------------------------------------------------
1 |('pantry', 'coffee') | 25
2 |('clothes', 'summer', 'essentials')| 13
3 |('coffee', 'beverages', 'summer') | 9
Where the collections
column above is an array of strings.
I want to get total sales for each collection:
collection | total_sold
-----------------------
pantry | 25
coffee | 34
clothes | 13
summer | 22
essentials | 13
beverages | 9
Now I know I could do these one at a time with something like
WHERE 'coffee' IN UNNEST(collections)
or
SUM(CASE WHEN 'coffee' in UNNEST(collections) then number_sold else 0 end) as coffee_sold
But I’m looking for something more elegant that can handle lots and lots of collections and returns a table like the one above. Thank you all!
Advertisement
Answer
Consider below
select collection, sum(number_sold) total_sold
from `project.dataset.table` t, t.collections collection
group by collection
if applied to sample data in your question (… collections column … is an array of strings.)
with `project.dataset.table` as (
select 1 product_id, ['pantry', 'coffee'] collections, 25 number_sold union all
select 2, ['clothes', 'summer', 'essentials'], 13 union all
select 3, ['coffee', 'beverages', 'summer'], 9
)
output is