For rows with the same A, B, C values, how to select only “earliest” row (specified by column D) within a given time range?

I currently have a query (simplified below):

SELECT name, node, points, <bunch of other fields>,
  DATE(snapshot_date) AS snap_date,
  HOUR(snapshot_date) AS snap_hour,
  CASE
    WHEN MINUTE(snapshot_date) IN (5, 6) THEN 0
    WHEN MINUTE(snapshot_date) IN (35, 36) THEN 30
  END AS snap_mins,
from some_table
where 
  <bunch of conditions here>

Basically, the data source we’ve been getting data from writes their timestamps as when their job finished running rather than when the data actually came in (we can’t do anything about this on our end). Their jobs regularly end at 5-6 minutes after the data came in so we infer that the actual minutes value for the data by setting the snap_mins ourselves. (For reasons I won’t go into, we can only use the values on the half hour and need to have a row for the given name/id/node each half hour.)

However, I know this is really really flaky and I want to try to, instead of looking for specific minute values, just grab the first row whose snapshot_date’s minute value is somewhere between 4-10 minutes after each half hour (XX:00 or XX:30). For example, if the table looks like this:

| name  | node | points | ... | snapshot_date | 
| Jane  | 1    | 1      | ... | 1/1/21 22:02  |
| Jane  | 1    | 2      | ... | 1/1/21 22:05  | // take this value
| Jane  | 1    | 3      | ... | 1/1/21 22:09  |
| Jane  | 1    | 4      | ... | 1/1/21 22:38  | // take this value
| Jane  | 1    | 5      | ... | 1/1/21 22:41  |

| Jane  | 1    | 1      | ... | 1/1/21 23:05  | // take this value
| Jane  | 1    | 2      | ... | 1/1/21 23:06  |
| Jane  | 1    | 3      | ... | 1/1/21 23:35  | // take this value
| Jane  | 1    | 4      | ... | 1/1/21 23:38  |

| Jane  | 3    | 1      | ... | 1/1/21 23:02  |
| Jane  | 3    | 2      | ... | 1/1/21 23:07  | // take this value (current query wouldn't pick this up because it's only looking for snapshots where the minute value is 5 or 6) 
| Jane  | 3    | 3      | ... | 1/1/21 23:10  |
| Jane  | 3    | 4      | ... | 1/1/21 23:35  | // take this value
| Jane  | 3    | 5      | ... | 1/1/21 23:38  |

So the query results I want to get would be:

| name  | node | points | ... | snap_date | snap_hour | snap_mins
| Jane  | 1    | 2      | ... | 1/1/21    | 22        | 0
| Jane  | 1    | 4      | ... | 1/1/21    | 22        | 30
| Jane  | 1    | 1      | ... | 1/1/21    | 23        | 0
| Jane  | 1    | 3      | ... | 1/1/21    | 23        | 30
| Jane  | 3    | 2      | ... | 1/1/21    | 23        | 0
| Jane  | 3    | 4      | ... | 1/1/21    | 23        | 30

However, I’ve been struggling with determining if it’s even possible to do this. Any help would be greatly appreciated!

Answer

One option

Add the columns defining the associated 30 minute interval
Sequence number the rows within a 30 minute group
Take only sequence #1 from each group

WITH x AS (
  SELECT name, node, points, snapshot_date, <bunch of other fields>,
  DATE(snapshot_date) AS snap_date,
  HOUR(snapshot_date) AS snap_hour,
  CASE
    WHEN MINUTE(snapshot_date) < 31 THEN 0
    ELSE 30
  END AS snap_mins
FROM some_table
where 
  <bunch of conditions here> 
), 
y as (
  SELECT x.*,
  ROW_NUMBER() OVER (PARTITION BY name, node, snap_date, snap_hour, snap_minute ORDER BY snapshot_date) AS seq
  FROM x
)
SELECT * FROM y WHERE seq = 1

If you really need to omit rows from MINUTE 0-3 and 30-33 from consideration, then add a WHERE condition to the first CTE.

Advertisement

Answer