I have a table in BigQuery I would like to aggregate over a number of different categories (columns in the table) to produce a summary of the data, which is a simple COUNT and GROUP BY statement. However, for this project I would like to produce a “complete” count of every permutation of the categories (the possible values of the categories are known ahead of time) with 0 totals for combinations without any entries for that permutation.
Here is the simple initial COUNT and GROUP statement:
SELECT
COUNT(item.id) as total,
TIMESTAMP_TRUNC(item.date, WEEK) as week,
item.attribute_1,
item.attribute_2,
FROM item
GROUP BY week, attribute_1, attribute_2
ORDER BY week DESC, attribute_1, attribute_2
where attribute_1
has possible values ('A', 'B', 'C')
and attribute_2
has possible values ('X', 'Y', 'Z')
which produces something like this:
Row | total | week | attribute_1 | attribute_2
-----------------------------------------------------------------
1 | 1 | 2020-07-12 00:00:00 UTC | A | X
2 | 2 | 2020-07-12 00:00:00 UTC | A | Z
3 | 1 | 2020-07-12 00:00:00 UTC | B | X
4 | 1 | 2020-07-12 00:00:00 UTC | B | Y
5 | 3 | 2020-07-12 00:00:00 UTC | B | Z
6 | 1 | 2020-07-12 00:00:00 UTC | C | Y
7 | 1 | 2020-07-12 00:00:00 UTC | C | Z
etc.
However my goal is to end up with a result like this:
Row | total | week | attribute_1 | attribute_2
-----------------------------------------------------------------
1 | 1 | 2020-07-12 00:00:00 UTC | A | X
2 | 0 | 2020-07-12 00:00:00 UTC | A | Y
3 | 2 | 2020-07-12 00:00:00 UTC | A | Z
4 | 1 | 2020-07-12 00:00:00 UTC | B | X
5 | 1 | 2020-07-12 00:00:00 UTC | B | Y
6 | 3 | 2020-07-12 00:00:00 UTC | B | Z
7 | 0 | 2020-07-12 00:00:00 UTC | C | X
8 | 1 | 2020-07-12 00:00:00 UTC | C | Y
9 | 1 | 2020-07-12 00:00:00 UTC | C | Z
etc.
I found similar problems for a single category, so I can solve for that case by holding the other attributes constant with the following statement:
WITH data AS (
SELECT
item.id as item_id,
TIMESTAMP_TRUNC(item.date, WEEK) as week,
item.attribute_1,
item.attribute_2,
FROM item
GROUP BY week, attribute_1, attribute_2
ORDER BY week DESC, attribute_1, attribute_2
),
category_1 as (
select 'A' as name union all
select 'B' union all
select 'C'
)
SELECT
category_1.name,
(
SELECT COUNT(item_id) as total
FROM data
WHERE data.attribute_1 = category_1.name
AND data.attribute_2 = 'X'
AND data.week > TIMESTAMP('2020-07-10')
) as total
FROM category_1
which produces:
Row | name | total
------------------
1 | A | 1
2 | B | 1
3 | C | 0
Is it possible to solve for the general case I am interested in?
Advertisement
Answer
If I understand correctly, you can generate the rows using a cross join
and then match to the existing values:
SELECT COALESCE(SUM(i.total), 0) as total,
w.week, a1.attribute_1, a2.attribute_2
FROM (SELECT distinct TIMESTAMP_TRUNC(date, WEEK) as week FROM item) w CROSS JOIN
(SELECT distinct attribute_1 FROM item) a1 CROSS JOIN
(SELECT distinct attribute_2 FROM item) a2 LEFT JOIN
item i
ON TIMESTAMP_TRUNC(i.date, WEEK) = w.week AND
i.attribute_1 = a1.attribute_1 AND
i.attribute_2 = a2.attribute_2
GROUP BY w.week, a1.attribute_1, a2.attribute_2
ORDER BY week DESC, attribute_1, attribute_2;
Here is a db<>fiddle. It uses Postgres, so there are some syntax tweaks for Postgres.