Skip to content
Advertisement

How to unnest the table based on date interval in Presto?

I have a table with events, where start_dt is a start and end_dt an end of the event. This table is partitioned by dt column derived from the end_dt. This means that events that start before and end after midnight are present only in one partition. What I need to do is to split each event into as many rows as the number of dates the event was associated with. Is there any smart way to achieve it using the Presto SQL syntax?

Input:

  id  | start_dt                 | end_dt                  | dt
------+--------------------------+-------------------------+----------
 1    | 2020-09-24 21:56:12.669  | 2020-09-25 00:26:16.440 | 2020-09-25
 2    | 2020-09-25 17:12:02.699  | 2020-09-25 17:42:02.699 | 2020-09-25
 3    | 2020-09-23 23:47:29.146  | 2020-09-25 00:17:29.146 | 2020-09-25

Expected output:

  id  | start_dt                 | end_dt                  | dt
------+--------------------------+-------------------------+----------
 1    | 2020-09-24 21:56:12.669  | 2020-09-24 23:59:59.999 | 2020-09-24
 1    | 2020-09-25 00:00:00.001  | 2020-09-25 00:26:16.440 | 2020-09-25
 2    | 2020-09-25 17:12:02.699  | 2020-09-25 17:42:02.699 | 2020-09-25
 3    | 2020-09-23 23:47:29.146  | 2020-09-23 23:59:59.999 | 2020-09-23
 3    | 2020-09-24 00:00:00.001  | 2020-09-24 23:59:59.999 | 2020-09-24
 3    | 2020-09-25 00:00:00.001  | 2020-09-25 00:17:29.146 | 2020-09-25

Advertisement

Answer

In Presto, you can use sequence() to generate an array of dates. The rest is just unnesting and conditional logic:

select t.id, 
    case when date(t.start_dt) = s.dt then t.start_dt else cast(s.dt as timestamp)                    end as new_start_dt,
    case when date(t.end_dt)   = s.dt then t.end_dt   else cast(s.dt as timestamp) + interval '1' day end as new_end_dt,
    s.dt
from mytable t
cross join unnest(sequence(date(t.start_dt), date(t.end_dt))) as s(dt)

Note that this generates dates intervals that start and end at midnight exactly: the half-open intervals logic makes more sense to me than the removing or adding a millisecond here and there. You can easily change that if you like.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement