To put the problem in words, I have a massive table which includes subscribers and data for every day. If the subscriber no longer exists, then they will have no more records i.e. SUB123 no longer exists from the 28/10/2021 then this subscriber will have records up every day until 27/10/2021. The problem at hand is that some subscribers have missing dates and this could perhaps be as it is a weekend or other problems. I want to fill these records with null values so that they could be on record.
The current problem:
Subscriber | Date | Rev |
---|---|---|
sub123 | 25/10/2021 | 256 |
sub456 | 25/10/2021 | 282 |
sub123 | 26/10/2021 | 652 |
sub123 | 27/10/2021 | 396 |
sub456 | 28/10/2021 | 132 |
sub456 | 29/10/2021 | 484 |
sub456 | 01/11/2021 | 96 |
sub456 | 02/11/2021 | 45 |
The desired solution:
Subscriber | Date | Rev |
---|---|---|
sub123 | 25/10/2021 | 256 |
sub456 | 25/10/2021 | 282 |
sub123 | 26/10/2021 | 652 |
sub456 | 26/10/2021 | NULL |
sub123 | 27/10/2021 | 396 |
sub456 | 27/10/2021 | NULL |
sub456 | 28/10/2021 | 132 |
sub456 | 29/10/2021 | 484 |
sub456 | 30/10/2021 | NULL |
sub456 | 31/10/2021 | NULL |
sub456 | 01/11/2021 | 96 |
sub456 | 02/11/2021 | 45 |
My current attempt:
WITH all_dates as ( SELECT CAST(date_column AS DATE) date_column, b.subscriber, b.date FROM (VALUES (SEQUENCE( min(b.date) OVER (PARTITION BY b.subscriber ORDER BY b.date), max(b.date) OVER (PARTITION BY b.subscriber ORDER BY b.date), INTERVAL '1' DAY) ) ) AS t1(date_array) CROSS JOIN UNNEST(date_array) AS t2(date_column) LEFT JOIN MAINTABLE b on t2.date_column = b.date ), customer_dates as ( SELECT distinct a.subscriber, a.date, b.date_column from MAINTABLE a left join all_dates b on a.date = b.date_column ) SELECT * from customer_dates a
This code doesn’t work but its an attempt to what I am trying to accomplish if I were to use the following code that is attached below it will generate dates for all subscribers from the initial date to the end date which is not what we want hence why the above code is what was attempted.
WITH all_dates as ( SELECT CAST(date_column AS DATE) date_column, b.subscriber, b.date FROM (VALUES (SEQUENCE( date('2021-10-25'), date('2022-04-30'), INTERVAL '1' DAY) ) ) AS t1(date_array) CROSS JOIN UNNEST(date_array) AS t2(date_column) LEFT JOIN MAINTABLE b on t2.date_column = b.date ), customer_dates as ( SELECT distinct a.subscriber, a.date, b.date_column from MAINTABLE a left join all_dates b on a.date = b.date_column ) SELECT * from customer_dates a
Advertisement
Answer
You can use lag
function to generate missing ranges to flatten with unnest
and handle Rev
additionally:
-- sample data WITH dataset (Subscriber, Date, Rev) AS ( VALUES ('sub123', date_parse('25-10-2021', '%d-%m-%Y'), 256), ('sub456', date_parse('25-10-2021', '%d-%m-%Y'), 282), ('sub123', date_parse('26-10-2021', '%d-%m-%Y'), 652), ('sub123', date_parse('27-10-2021', '%d-%m-%Y'), 396), ('sub456', date_parse('28-10-2021', '%d-%m-%Y'), 132), ('sub456', date_parse('29-10-2021', '%d-%m-%Y'), 484), ('sub456', date_parse('01-11-2021', '%d-%m-%Y'), 96), ('sub456', date_parse('02-11-2021', '%d-%m-%Y'), 45) ) -- query select subscriber, lifted_date as date, if(date = lifted_date, rev, NULL) rev from ( select Subscriber, Rev, cast(date as date) date, lag(cast(date as date)) over(partition by Subscriber order by date) prev_date from dataset ) cross join unnest( array_except(sequence(coalesce(prev_date, date), date, interval '1' day), array[prev_date]) ) as t(lifted_date) order by subscriber, date
Output:
subscriber | date | rev |
---|---|---|
sub123 | 2021-10-25 00:00:00.000 | 256 |
sub123 | 2021-10-26 00:00:00.000 | 652 |
sub123 | 2021-10-27 00:00:00.000 | 396 |
sub456 | 2021-10-25 00:00:00.000 | 282 |
sub456 | 2021-10-26 00:00:00.000 | |
sub456 | 2021-10-27 00:00:00.000 | |
sub456 | 2021-10-28 00:00:00.000 | 132 |
sub456 | 2021-10-29 00:00:00.000 | 484 |
sub456 | 2021-10-30 00:00:00.000 | |
sub456 | 2021-10-31 00:00:00.000 | |
sub456 | 2021-11-01 00:00:00.000 | 96 |
sub456 | 2021-11-02 00:00:00.000 | 45 |