I have a table in BigQuery I would like to aggregate over a number of different categories (columns in the table) to produce a summary of the data, which is a simple COUNT and GROUP BY statement. However, for this project I would like to produce a “complete” count of every permutation of the categories (the possible values of the categories are known ahead of time) with 0 totals for combinations without any entries for that permutation.
Here is the simple initial COUNT and GROUP statement:
SELECT COUNT(item.id) as total, TIMESTAMP_TRUNC(item.date, WEEK) as week, item.attribute_1, item.attribute_2, FROM item GROUP BY week, attribute_1, attribute_2 ORDER BY week DESC, attribute_1, attribute_2
where attribute_1
has possible values ('A', 'B', 'C')
and attribute_2
has possible values ('X', 'Y', 'Z')
which produces something like this:
Row | total | week | attribute_1 | attribute_2 ----------------------------------------------------------------- 1 | 1 | 2020-07-12 00:00:00 UTC | A | X 2 | 2 | 2020-07-12 00:00:00 UTC | A | Z 3 | 1 | 2020-07-12 00:00:00 UTC | B | X 4 | 1 | 2020-07-12 00:00:00 UTC | B | Y 5 | 3 | 2020-07-12 00:00:00 UTC | B | Z 6 | 1 | 2020-07-12 00:00:00 UTC | C | Y 7 | 1 | 2020-07-12 00:00:00 UTC | C | Z etc.
However my goal is to end up with a result like this:
Row | total | week | attribute_1 | attribute_2 ----------------------------------------------------------------- 1 | 1 | 2020-07-12 00:00:00 UTC | A | X 2 | 0 | 2020-07-12 00:00:00 UTC | A | Y 3 | 2 | 2020-07-12 00:00:00 UTC | A | Z 4 | 1 | 2020-07-12 00:00:00 UTC | B | X 5 | 1 | 2020-07-12 00:00:00 UTC | B | Y 6 | 3 | 2020-07-12 00:00:00 UTC | B | Z 7 | 0 | 2020-07-12 00:00:00 UTC | C | X 8 | 1 | 2020-07-12 00:00:00 UTC | C | Y 9 | 1 | 2020-07-12 00:00:00 UTC | C | Z etc.
I found similar problems for a single category, so I can solve for that case by holding the other attributes constant with the following statement:
WITH data AS ( SELECT item.id as item_id, TIMESTAMP_TRUNC(item.date, WEEK) as week, item.attribute_1, item.attribute_2, FROM item GROUP BY week, attribute_1, attribute_2 ORDER BY week DESC, attribute_1, attribute_2 ), category_1 as ( select 'A' as name union all select 'B' union all select 'C' ) SELECT category_1.name, ( SELECT COUNT(item_id) as total FROM data WHERE data.attribute_1 = category_1.name AND data.attribute_2 = 'X' AND data.week > TIMESTAMP('2020-07-10') ) as total FROM category_1
which produces:
Row | name | total ------------------ 1 | A | 1 2 | B | 1 3 | C | 0
Is it possible to solve for the general case I am interested in?
Advertisement
Answer
If I understand correctly, you can generate the rows using a cross join
and then match to the existing values:
SELECT COALESCE(SUM(i.total), 0) as total, w.week, a1.attribute_1, a2.attribute_2 FROM (SELECT distinct TIMESTAMP_TRUNC(date, WEEK) as week FROM item) w CROSS JOIN (SELECT distinct attribute_1 FROM item) a1 CROSS JOIN (SELECT distinct attribute_2 FROM item) a2 LEFT JOIN item i ON TIMESTAMP_TRUNC(i.date, WEEK) = w.week AND i.attribute_1 = a1.attribute_1 AND i.attribute_2 = a2.attribute_2 GROUP BY w.week, a1.attribute_1, a2.attribute_2 ORDER BY week DESC, attribute_1, attribute_2;
Here is a db<>fiddle. It uses Postgres, so there are some syntax tweaks for Postgres.