I need to count the number of products that existed in inventory by date. In the database however, a product is only recorded when it was viewed by a consumer.
For example consider this basic table structure:
date | productId | views July 1 | A | 8 July 2 | A | 6 July 2 | B | 4 July 3 | A | 2 July 4 | A | 8 July 4 | B | 6 July 4 | C | 4 July 5 | C | 2 July 10 | A | 17
Using the following query, I attempt to determine the amount of products in inventory on a given date.
select date, count(distinct productId) as Inventory, sum(views) as views from ( select date, productId, count(*) as views from SomeTable group by date, productID order by date asc, productID asc ) group by date
This is the output
date | Inventory | views July 1 | 1 | 8 July 2 | 2 | 10 July 3 | 1 | 2 July 4 | 3 | 18 July 5 | 1 | 2 July 10 | 1 | 17
My output is not an accurate reflection of how many products were in inventory due to missing rows.
The correct understanding of inventory is as follows:
– Product A was present in inventory from July 1 – July 10.
– Product B was present in inventory from July 2 – July 4.
– Product C was in inventory from July 4 – July 5.
The correct SQL output should be:
date | Inventory | views July 1 | 1 | 8 July 2 | 2 | 10 July 3 | 2 | 2 July 4 | 3 | 18 July 5 | 2 | 2 July 6 | 1 | 0 July 7 | 1 | 0 July 8 | 1 | 0 July 9 | 1 | 0 July 10 | 1 | 17
If you are following along, let me confirm that I am comfortable defining “in inventory” as the date difference between the first & last view.
I have followed the following faulty process:
First I created a table which was the cartesian product of every productID & every date. ”’ with Dates as ( select date from SomeTable group by date ), Products as ( select productId from SomeTable group by productId ) select Dates.date, Products.productId from Dates cross join Products ”’
Then I attempted do a right outer join to reduce this to just the missing records:
with Records as ( select date, productId, count(*) as views from SomeTable group by date, productId ), Cartesian as ( {See query above} ) Select Cartesian.date, Cartesian.productId, 0 as views #for upcoming union from Cartesian right outer join Records on Cartesian.date = Records.date where Records.productId is null
Then with the missing rows in hand, union them back onto the Records. in doing so, I create a new problem: extra rows.
date | productId | views July 1 | A | 8 July 1 | B | 0 July 1 | C | 0 July 2 | A | 6 July 2 | B | 4 July 2 | C | 0 July 3 | A | 2 July 3 | B | 0 July 3 | C | 0 July 4 | A | 8 July 4 | B | 6 July 4 | C | 4 July 5 | A | 2 July 5 | B | 0 July 5 | C | 0 July 6 | A | 0 July 6 | B | 0 July 6 | C | 0 July 7 | A | 0 July 7 | B | 0 July 7 | C | 0 July 8 | A | 0 July 8 | B | 0 July 8 | C | 0 July 9 | A | 0 July 9 | B | 0 July 9 | C | 0 July 10 | A | 17 July 10 | B | 0 July 10 | C | 0
And when I run my simple query
select date, count(distinct productId) as Inventory, sum(views) as views
on that table I get the wrong output again:
date | Inventory | views July 1 | 3 | 8 July 2 | 3 | 10 July 3 | 3 | 2 July 4 | 3 | 18 July 5 | 3 | 2 July 6 | 3 | 0 July 7 | 3 | 0 July 8 | 3 | 0 July 9 | 3 | 0 July 10 | 3 | 17
My next thought would be to iterate through each productId, determine it’s first & last date, then Union that with the Cartesian table with the condition that the Cartesian.date falls between the first & last date for each specific product.
There’s got to be an easier way to do this. Thanks.
Advertisement
Answer
Below is for BigQuery Standard SQL
#standardSQL WITH dates AS ( SELECT day FROM ( SELECT MIN(day) min_day, MAX(day) max_day FROM `project.dataset.table` ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day ), ranges AS ( SELECT productId, MIN(day) min_day, MAX(day) max_day FROM `project.dataset.table` t GROUP BY productId ) SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views FROM dates d, ranges r LEFT JOIN `project.dataset.table` USING(day, productId) WHERE day BETWEEN min_day AND max_day GROUP BY day
If to apply to sample data from your question as in below example
#standardSQL WITH `project.dataset.table` AS ( SELECT DATE '2019-07-01' day, 'A' productId, 8 views UNION ALL SELECT '2019-07-02', 'A', 6 UNION ALL SELECT '2019-07-02', 'B', 4 UNION ALL SELECT '2019-07-03', 'A', 2 UNION ALL SELECT '2019-07-04', 'A', 8 UNION ALL SELECT '2019-07-04', 'B', 6 UNION ALL SELECT '2019-07-04', 'C', 4 UNION ALL SELECT '2019-07-05', 'C', 2 UNION ALL SELECT '2019-07-10', 'A', 17 ), dates AS ( SELECT day FROM ( SELECT MIN(day) min_day, MAX(day) max_day FROM `project.dataset.table` ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day ), ranges AS ( SELECT productId, MIN(day) min_day, MAX(day) max_day FROM `project.dataset.table` t GROUP BY productId ) SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views FROM dates d, ranges r LEFT JOIN `project.dataset.table` USING(day, productId) WHERE day BETWEEN min_day AND max_day GROUP BY day -- ORDER BY day
result is
Row day Inventory views 1 2019-07-01 1 8 2 2019-07-02 2 10 3 2019-07-03 2 2 4 2019-07-04 3 18 5 2019-07-05 2 2 6 2019-07-06 1 0 7 2019-07-07 1 0 8 2019-07-08 1 0 9 2019-07-09 1 0 10 2019-07-10 1 17