Skip to content
Advertisement

Group By based on consequtive flag in Redshift (Gaps and Islands problem)

I am trying to solve “gaps and islands” and group consecutive checks together. My data looks like this

site_id     date_id    location_id    reservation_id    revenue
   5        20210101      125            792727           100
   5        20210101      126            792728           90
   5        20210101      228            792757           200
   5        20210102      217            792977           50
   5        20210102      218            792978           120
   5        20210102      219            792979           100

I want to group by consecutive location_id and consecutive reservation_id (both should be consecutive respectively) within same date and site_id, and sum revenue. so for the example above the output should be:

site_id     date_id    location_id    reservation_id    revenue
   5        20210101      125            792727           190
   5        20210101      228            792757           200
   5        20210102      217            792977           270

Location_id and reservation_id are of no importance except for this particular task, so a simple MAX() or MIN() for these two columns will work.

Advertisement

Answer

Try sessionization:

Two nested queries. First, a counter that is at 0 when a condition is false, at 1 when it’s true; in our case, the previous reservation id not being exactly one less than the current one.

The second query queries the first query, and makes a running sum of the counter obtained before. This gives a session id.

Then, group by site id, date id and the obtained session id.

WITH
indata(site_id,date_id,location_id,reservation_id,revenue) AS (
          SELECT 5,DATE '2021-01-01',125,792727,100
UNION ALL SELECT 5,DATE '2021-01-01',126,792728,90
UNION ALL SELECT 5,DATE '2021-01-01',228,792757,200
UNION ALL SELECT 5,DATE '2021-01-02',217,792977,50
UNION ALL SELECT 5,DATE '2021-01-02',218,792978,120
UNION ALL SELECT 5,DATE '2021-01-02',219,792979,100
)
,
with_counter AS (
  SELECT
    site_id
  , date_id
  , location_id
  , reservation_id
  , revenue
  , CASE
      WHEN reservation_id - LAG(reservation_id) OVER(
         PARTITION BY site_id ORDER BY date_id,reservation_id
      ) > 1
      THEN 1
      ELSE 0
    END AS counter
  FROM indata
)
,
with_session AS (
  SELECT
    site_id
  , date_id
  , location_id
  , reservation_id
  , revenue
  , SUM(counter) OVER(
      PARTITION BY site_id ORDER BY date_id,reservation_id
    ) AS session_id
  FROM with_counter
  -- test output ...
  -- out  site_id |  date_id   | location_id | reservation_id | revenue | session_id 
  -- out ---------+------------+-------------+----------------+---------+------------
  -- out        5 | 2021-01-01 |         125 |         792727 |     100 |          0
  -- out        5 | 2021-01-01 |         126 |         792728 |      90 |          0
  -- out        5 | 2021-01-01 |         228 |         792757 |     200 |          1
  -- out        5 | 2021-01-02 |         217 |         792977 |      50 |          2
  -- out        5 | 2021-01-02 |         218 |         792978 |     120 |          2
  -- out        5 | 2021-01-02 |         219 |         792979 |     100 |          2
)
SELECT
  site_id
, date_id
, MIN(location_id   ) AS location_id
, MIN(reservation_id) AS reservation_id
, SUM(revenue       ) AS revenue
FROM with_session
GROUP BY
  site_id
, date_id
, session_id
ORDER BY
  site_id
, date_id
;
-- out  site_id |  date_id   | location_id | reservation_id | revenue 
-- out ---------+------------+-------------+----------------+---------
-- out        5 | 2021-01-01 |         125 |         792727 |     190
-- out        5 | 2021-01-01 |         228 |         792757 |     200
-- out        5 | 2021-01-02 |         217 |         792977 |     270                                                                                                                  
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement