Skip to content
Advertisement

SQL: Pull rows based on sequence of values

I need to pull rows of data based on the existence of certain values that exist in a specific sequence.

Here’s an example of the data:

Header EventId EventDate
67891882 382 2022-01-21 09:29:50.000
67891882 81 2022-01-21 09:03:23.000
67891882 273 2022-01-21 09:03:51.000
67891882 77 2022-01-21 09:05:58.000
67891882 2 2022-01-21 09:29:48.000

The results I need are to capture the Header and the EventDate for EventId=81. Further criteria include:

  • EventID 81 is the “start” and EventID 77 is the “end”
  • Any number of other events can exist between these two with the exception of (60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199)

So in the example above, Eventid 81 with EventDate 2022-01-21 09:03:23.000 would qualify as a row I want to pull as 273 is not in the exception list.

ATTEMPT: I have tried the following query

SELECT *
FROM #Table
WHERE EventDate BETWEEN (SELECT EventDate
                         FROM #Table
                         WHERE EventId = 81)
                    AND (SELECT eventdate
                         FROM #Table
                         WHERE EventId = 77)
    AND EventId NOT IN (60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199)
ORDER BY 3

But I was immediately confronted with the fact that my sub-queries return more than one result, so this won’t work (I was using this to test a singular Header # example, which worked fine). So now I’m not quite sure how to proceed. I’d hate to think that I’d be forced to use a CURSOR, mostly because my source data is comprised of 266 million rows.

I had also previously tried using the LAG() function to find my “starting point”, but that possibility seemed to dissipate once the request started becoming more and more complex (with the addition of the exclusion list as well as the fact that there could be 1 or 40 rows in between the 81 and 77).


How should I proceed with this? Here’s some example data to play with. The Header can be thought of as a parent key, associated with any number of EventID (representing a specific action) and the EventDate with when this occurred:

create table #data (header int, eventid int, eventdate datetime)

insert into #data 
values
('62252595',    '22',   '5/23/2021  12:34:02 PM'),
('62252595',    '81',   '5/23/2021  12:34:03 PM'),
('62252595',    '29',   '5/23/2021  12:34:12 PM'),
('62252595',    '40',   '5/23/2021  12:34:27 PM'),
('62252595',    '22',   '5/23/2021  12:35:02 PM'),
('62252595',    '22',   '5/23/2021  12:36:12 PM'),
('62252595',    '37',   '5/23/2021  12:36:36 PM'),
('62252595',    '77',   '5/23/2021  12:37:04 PM'),
('62252595',    '6',    '5/23/2021  12:37:52 PM'),
('63252595',    '39',   '5/23/2021  12:38:01 PM'),
('63252595',    '81',   '5/23/2021  12:38:04 PM'),
('63252595',    '37',   '5/23/2021  12:38:06 PM'),
('63252595',    '21',   '5/23/2021  12:38:09 PM'),
('63252595',    '75',   '5/23/2021  12:38:10 PM'),
('63252595',    '77',   '5/23/2021  12:38:12 PM'),
('64252595',    '29',   '5/23/2021  12:38:15 PM'),
('64252595',    '26',   '5/23/2021  12:38:18 PM'),
('64252595',    '81',   '5/23/2021  12:38:20 PM'),
('64252595',    '40',   '5/23/2021  12:38:21 PM'),
('64252595',    '81',   '5/23/2021  12:38:24 PM'),
('64252595',    '83',   '5/23/2021  12:39:06 PM'),
('64252595',    '77',   '5/23/2021  12:39:07 PM'),
('65252595',    '41',   '5/23/2021  12:39:12 PM'),
('65252595',    '81',   '5/23/2021  12:39:16 PM'),
('65252595',    '37',   '5/23/2021  12:39:20 PM'),
('65252595',    '18',   '5/23/2021  12:39:56 PM'),
('65252595',    '18',   '5/23/2021  12:40:03 PM'),
('65252595',    '77',   '5/23/2021  12:40:15 PM'),
('65252595',    '36',   '5/23/2021  12:40:46 PM'),
('65252595',    '77',   '5/23/2021  12:40:53 PM')

EXPECTED RESULTS: From this #Data table, the results I would expect to see would be:

Header EventId EventDate
62252595 81 5/23/2021 12:34:03 PM
65252595 81 5/23/2021 12:39:16 PM

Header #’s 63252595 and 64252595 would not qualify because between the first instance of 81 and the first instance of 77 (partition by Header order by EventDate), there exists a 75 at 5/23/2021 12:38:10 PM and an 83 at 5/23/2021 12:39:06 PM respectively (both of which in exclusion list). I hope this clears up some confusion.


EDIT: After some thinking, I wonder if it would be possible to simplify this using a CASE expression. Using the example data from the #Data table above, I wrote this query:

select *
from (
    select * from (
        select *, id=case when EventId = 81 then 1 
                            when EventId = 77 then 2
                            when EventId in (60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199) then 5 else 0 end
        from #data) a
    where id <> 0)b
    order by 3

What this does is filters out all of the ‘allowable’ events and makes it so that I can filter to only see the unencumbered events where id=1 and then follows with a 2. What I’m not sure of as of yet is how to get it to show me only entries of id=1 with a following 2.

Advertisement

Answer

I’m going to assume that a start event (81) always starts a new “frame” from that row onwards, and an end event (77) always starts a new frame from the following row onwards.

I’m also going to assume that you’re only interested in frames where both a start and end event are present, and that the frame contains no excepted events (I’ll just use 00 as random allowable events and 199 as the only excepted event).

For example…

[81,00,81,00,77,00,81,199,77]
=> frame 0 = [81,00]
=> frame 1 = [81,00,77]
=> frame 2 = [00]
=> frame 3 = [81,199,77]

In that example only the 2nd frame’s start event would be returned (the others missing start and/or end events, or containing the excepted event).

WITH
  frame_start AS
(
  SELECT
    *,
    CASE
      WHEN
        81 = eventid
      OR
        77 = LAG(eventid) OVER (PARTITION BY header ORDER BY eventdate)
      THEN
        1
      ELSE
        0
    END
      AS new_frame
  FROM
    #data
),
  framed AS
(
  SELECT
    *,
    SUM(new_frame) OVER (PARTITION BY header ORDER BY eventdate) AS frame_id
  FROM
    frame_start
)
SELECT
  header, MIN(eventdate)
FROM
  framed
GROUP BY
  header, frame_id
HAVING
  SUM(CASE WHEN eventid IN (81,77) THEN 1 ELSE 0 END) = 2
  AND
  MAX(CASE WHEN eventid IN (199, etc) THEN 1 ELSE 0 END) = 0

Demo : https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=d54493c87629e3e59759ac9d119ec6ad


Explanation:

The first CTE adds a column called new_frame.

  • 1 = current row is 81,or previous row is 77
  • 0 = everything else

This marks the start of each new frame (as described at the top here).

The next CTE assigns an id to every row in each frame, by cumulatively summing the new_frame, in datetime order. The id starts at 0, then is incremented on each row by that row’s new_frame value (if new_frame=0, keep the same id as the previous row, if new_frame=1 increment the id by 1).

At this point the header’s rows are broken down in to frames (as described at the top here).

The final query groups by the frame and then filters the results with a HAVING clause. The first check is that the number of rows in the frame with 81 or 77 must total 2. The second check is that no rows in the frame can have an excepted event. If all checks pass, return the minimum timestamp in the frame, which by definition comes from the first row in the frame.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement