Get cumulative distinct count of active ids(ids where deleted date is null as of/before the modified date)

Question

I am facing a problem while getting the cumulative distinct count of resource ids as of different modified dates in vertica. If you see the below table I have resource id, modified date and deleted date and I want to calculate the count of distinct active resources as of all unique modified dates. A resource is considered active when deleted

Accepted Answer

To me, a DATE has no hours, minutes, seconds, let alone second fractions, so I renamed the time containing attributes to %_ts, as they are TIMESTAMPs.I had to completely start from scratch to solve it.I think this is the first problem I had to solve with as much as 5 Common Table Expressions:Add a Boolean is_active that is never NULLAdd the previous obtained is_active using LAG(). NULL here means there is no predecessor for the same resource id.remove the rows whose previous is_active is equal to the current is_active.UNION SELECT the positive COUNT DISTINCTs of the active rows and the negative COUNT DISTINCTs of the inactive rows. This also removes the last timestamp.get the distinct timestamps from the original input for the final queryThe final query takes CTE 5 and LEFT JOINs it with CTE 4, making a running sum of the obtained distinct counts.Here goes:WITH                                                                                                                                                                                                      -- not part of the final query: this is your input dataindata(sa_resource_id,modified_ts,deleted_ts) AS (          SELECT 1,TIMESTAMP '2022-01-22 15:46:06.758',NULLUNION ALL SELECT 2,TIMESTAMP '2022-01-22 15:46:06.758',NULLUNION ALL SELECT 16,TIMESTAMP '2022-04-22 15:46:06.758',NULLUNION ALL SELECT 17,TIMESTAMP '2022-04-22 15:46:06.758',NULLUNION ALL SELECT 18,TIMESTAMP '2022-04-22 15:46:06.758',NULLUNION ALL SELECT 16,TIMESTAMP '2022-04-29 15:46:06.758',TIMESTAMP '2022-04-29 15:46:06.758'UNION ALL SELECT 17,TIMESTAMP '2022-04-29 15:46:06.758',TIMESTAMP '2022-04-29 15:46:06.758'UNION ALL SELECT 1,TIMESTAMP '2022-05-22 15:46:06.758',TIMESTAMP '2022-05-22 15:46:06.758'UNION ALL SELECT 2,TIMESTAMP '2022-05-22 15:46:06.758',TIMESTAMP '2022-05-22 15:46:06.758'UNION ALL SELECT 1,TIMESTAMP '2022-05-23 22:16:06.758',NULLUNION ALL SELECT 1,TIMESTAMP '2022-05-24 22:16:06.758',TIMESTAMP '2022-05-24 22:16:06.758'UNION ALL SELECT 1,TIMESTAMP '2022-05-25 22:16:06.758',NULLUNION ALL SELECT 1,TIMESTAMP '2022-05-27 22:16:06.758',NULL)-- real query starts here, replace the following comma with "WITH" ...,-- need a "active flag" that is never nullw_active_flag AS (  SELECT    *  , (deleted_ts IS NULL) AS is_active  FROM indata),-- need current and previous is_active to filter ..w_prev_flag AS (  SELECT    *  , LAG(is_active) OVER w AS prev_flag  FROM w_active_flag  WINDOW w AS(PARTITION BY sa_resource_id ORDER BY modified_ts)),-- use obtained filter arguments to filter out two consecutive -- active or non-active rows for same sa_resource_id-- this can remove timestamps from the final resultde_duped AS (  SELECT    sa_resource_id  , modified_ts  , is_active  FROM w_prev_flag  WHERE prev_flag IS NULL OR prev_flag <> is_active)-- get count distinct "sa_resource_id" only now,grp AS (  SELECT    modified_ts  , COUNT(DISTINCT sa_resource_id) AS dca_agent_count  FROM de_duped  WHERE is_active  GROUP BY modified_ts  UNION ALL  SELECT    modified_ts  , COUNT(DISTINCT sa_resource_id) * -1 AS dca_agent_count  FROM de_duped  WHERE NOT is_active  GROUP BY modified_ts),-- get back all input timestamps in a help tabletslist AS (  SELECT DISTINCT    modified_ts  FROM indata)SELECT  tslist.modified_ts, SUM(NVL(dca_agent_count,0)) OVER w AS dca_agent_countFROM tslist LEFT JOIN grp USING(modified_ts)WINDOW w AS (ORDER BY tslist.modified_ts);-- out        modified_ts       | dca_agent_count -- out -------------------------+------------------- out  2022-01-22 15:46:06.758 |               2-- out  2022-04-22 15:46:06.758 |               5-- out  2022-04-29 15:46:06.758 |               3-- out  2022-05-22 15:46:06.758 |               1-- out  2022-05-23 22:16:06.758 |               2-- out  2022-05-24 22:16:06.758 |               1-- out  2022-05-25 22:16:06.758 |               2-- out  2022-05-27 22:16:06.758 |               2

sa_resource_id	modified_date	deleted_Date
1	2022-01-22 15:46:06.758
2	2022-01-22 15:46:06.758
16	2022-04-22 15:46:06.758
17	2022-04-22 15:46:06.758
18	2022-04-22 15:46:06.758
16	2022-04-29 15:46:06.758	2022-04-29 15:46:06.758
17	2022-04-29 15:46:06.758	2022-04-29 15:46:06.758
1	2022-05-22 15:46:06.758	2022-05-22 15:46:06.758
2	2022-05-22 15:46:06.758	2022-05-22 15:46:06.758
1	2022-05-23 22:16:06.758
1	2022-05-24 22:16:06.758	2022-05-24 22:16:06.758
1	2022-05-25 22:16:06.758
1	2022-05-27 22:16:06.758

Advertisement

Answer