Skip to content
Advertisement

SQL: Filling in Missing Records with Conditional

I need to count the number of products that existed in inventory by date. In the database however, a product is only recorded when it was viewed by a consumer.

For example consider this basic table structure:

date      |  productId |   views
July 1    |  A         |   8
July 2    |  A         |   6
July 2    |  B         |   4
July 3    |  A         |   2
July 4    |  A         |   8
July 4    |  B         |   6
July 4    |  C         |   4
July 5    |  C         |   2
July 10   |  A         |   17

Using the following query, I attempt to determine the amount of products in inventory on a given date.

select date, count(distinct productId) as Inventory, sum(views) as views
from (
   select date, productId, count(*) as views
   from SomeTable
   group by date, productID
   order by date asc, productID asc
)
group by date

This is the output

date      |  Inventory |   views
July 1    |  1         |   8
July 2    |  2         |   10
July 3    |  1         |   2
July 4    |  3         |   18
July 5    |  1         |   2
July 10   |  1         |   17

My output is not an accurate reflection of how many products were in inventory due to missing rows.

The correct understanding of inventory is as follows:
– Product A was present in inventory from July 1 – July 10.
– Product B was present in inventory from July 2 – July 4.
– Product C was in inventory from July 4 – July 5.

The correct SQL output should be:

date      |  Inventory |   views
July 1    |  1         |   8
July 2    |  2         |   10
July 3    |  2         |   2
July 4    |  3         |   18
July 5    |  2         |   2
July 6    |  1         |   0
July 7    |  1         |   0
July 8    |  1         |   0
July 9    |  1         |   0
July 10   |  1         |   17

If you are following along, let me confirm that I am comfortable defining “in inventory” as the date difference between the first & last view.

I have followed the following faulty process:

First I created a table which was the cartesian product of every productID & every date. ”’ with Dates as ( select date from SomeTable group by date ), Products as ( select productId from SomeTable group by productId ) select Dates.date, Products.productId from Dates cross join Products ”’

Then I attempted do a right outer join to reduce this to just the missing records:

with Records as (
select date, productId, count(*) as views
from SomeTable
group by date, productId
),
Cartesian as (
{See query above}
)
Select Cartesian.date, Cartesian.productId, 0 as views #for upcoming union
from Cartesian right outer join Records 
on Cartesian.date = Records.date 
where Records.productId is null

Then with the missing rows in hand, union them back onto the Records. in doing so, I create a new problem: extra rows.

date      |  productId |   views
July 1    |  A         |   8
July 1    |  B         |   0
July 1    |  C         |   0
July 2    |  A         |   6
July 2    |  B         |   4
July 2    |  C         |   0
July 3    |  A         |   2
July 3    |  B         |   0
July 3    |  C         |   0
July 4    |  A         |   8
July 4    |  B         |   6
July 4    |  C         |   4
July 5    |  A         |   2
July 5    |  B         |   0
July 5    |  C         |   0
July 6    |  A         |   0
July 6    |  B         |   0
July 6    |  C         |   0
July 7    |  A         |   0
July 7    |  B         |   0
July 7    |  C         |   0
July 8    |  A         |   0
July 8    |  B         |   0
July 8    |  C         |   0
July 9    |  A         |   0
July 9    |  B         |   0
July 9    |  C         |   0
July 10   |  A         |   17
July 10   |  B         |   0
July 10   |  C         |   0

And when I run my simple query select date, count(distinct productId) as Inventory, sum(views) as views on that table I get the wrong output again:

date      |  Inventory |   views
July 1    |  3         |   8
July 2    |  3         |   10
July 3    |  3         |   2
July 4    |  3         |   18
July 5    |  3         |   2
July 6    |  3         |   0
July 7    |  3         |   0
July 8    |  3         |   0
July 9    |  3         |   0
July 10   |  3         |   17

My next thought would be to iterate through each productId, determine it’s first & last date, then Union that with the Cartesian table with the condition that the Cartesian.date falls between the first & last date for each specific product.
There’s got to be an easier way to do this. Thanks.

Advertisement

Answer

Below is for BigQuery Standard SQL

#standardSQL
WITH dates AS (
  SELECT day FROM (
    SELECT MIN(day) min_day, MAX(day) max_day
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day
), ranges AS (
  SELECT productId, MIN(day) min_day, MAX(day) max_day
  FROM `project.dataset.table` t
  GROUP BY productId
)
SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views
FROM dates d, ranges r 
LEFT JOIN `project.dataset.table` USING(day, productId)
WHERE day BETWEEN min_day AND max_day 
GROUP BY day

If to apply to sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT DATE '2019-07-01' day, 'A' productId, 8 views UNION ALL
  SELECT '2019-07-02', 'A', 6 UNION ALL
  SELECT '2019-07-02', 'B', 4 UNION ALL
  SELECT '2019-07-03', 'A', 2 UNION ALL
  SELECT '2019-07-04', 'A', 8 UNION ALL
  SELECT '2019-07-04', 'B', 6 UNION ALL
  SELECT '2019-07-04', 'C', 4 UNION ALL
  SELECT '2019-07-05', 'C', 2 UNION ALL
  SELECT '2019-07-10', 'A', 17 
), dates AS (
  SELECT day FROM (
    SELECT MIN(day) min_day, MAX(day) max_day
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day
), ranges AS (
  SELECT productId, MIN(day) min_day, MAX(day) max_day
  FROM `project.dataset.table` t
  GROUP BY productId
)
SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views
FROM dates d, ranges r 
LEFT JOIN `project.dataset.table` USING(day, productId)
WHERE day BETWEEN min_day AND max_day 
GROUP BY day
-- ORDER BY day

result is

Row day         Inventory   views    
1   2019-07-01  1           8    
2   2019-07-02  2           10   
3   2019-07-03  2           2    
4   2019-07-04  3           18   
5   2019-07-05  2           2    
6   2019-07-06  1           0    
7   2019-07-07  1           0    
8   2019-07-08  1           0    
9   2019-07-09  1           0    
10  2019-07-10  1           17   
9 People found this is helpful
Advertisement