I am facing a problem while getting the cumulative distinct count of resource ids as of different modified dates in vertica. If you see the below table I have resource id, modified date and deleted date and I want to calculate the count of distinct active resources as of all unique modified dates. A resource is considered active when deleted date is null as of/before that modified date.
I was able to get the count when for a particular resource lets say resource id 1 the active count(deleted date null) or inactive count(deleted date not null) dont occur consecutively.
But when they occur consecutively I have to take the count as 1 till it becomes inactive and then I have to consider count as 0 for that resource id when it becomes inactive and all consecutive inactive values till it becomes active again. Likewise for all the distinct resource ids and cumulative sum of those.
| sa_resource_id | modified_date | deleted_Date |
|---|---|---|
| 1 | 2022-01-22 15:46:06.758 | |
| 2 | 2022-01-22 15:46:06.758 | |
| 16 | 2022-04-22 15:46:06.758 | |
| 17 | 2022-04-22 15:46:06.758 | |
| 18 | 2022-04-22 15:46:06.758 | |
| 16 | 2022-04-29 15:46:06.758 | 2022-04-29 15:46:06.758 |
| 17 | 2022-04-29 15:46:06.758 | 2022-04-29 15:46:06.758 |
| 1 | 2022-05-22 15:46:06.758 | 2022-05-22 15:46:06.758 |
| 2 | 2022-05-22 15:46:06.758 | 2022-05-22 15:46:06.758 |
| 1 | 2022-05-23 22:16:06.758 | |
| 1 | 2022-05-24 22:16:06.758 | 2022-05-24 22:16:06.758 |
| 1 | 2022-05-25 22:16:06.758 | |
| 1 | 2022-05-27 22:16:06.758 |
This is the partition and sum query I have tried out where I partition the table based on resource ids and do sum over different modified dates.
SELECT md,
dca_agent_count
FROM
(
SELECT modified_date AS md,
SUM(SUM(CASE WHEN deleted_Date IS NULL THEN 1
WHEN deleted_Date IS NOT NULL THEN -1 ELSE 0
END)) OVER (ORDER BY modified_date) AS dca_agent_count
FROM
(
SELECT sa_resource_id,
modified_date,
deleted_Date,
ROW_NUMBER() OVER (
PARTITION BY sa_Resource_id, deleted_Date
ORDER BY modified_date desc
) row_num
FROM mf_Shared_provider_Default.dca_entity_resource_raw
WHERE sa_ResourcE_id IS NOT NULL
AND sa_resource_id IN ('1','2','34','16','17','18')
) t
GROUP BY modified_date
ORDER BY modified_Date
) b
Current Output:
| md | dca_agent_count |
|---|---|
| 2022-01-22 15:46:06.758 | 2 |
| 2022-04-22 15:46:06.758 | 5 |
| 2022-04-29 15:46:06.758 | 3 |
| 2022-05-22 15:46:06.758 | 1 |
| 2022-05-23 22:16:06.758 | 2 |
| 2022-05-24 22:16:06.758 | 1 |
| 2022-05-25 22:16:06.758 | 2 |
| 2022-05-27 22:16:06.758 | 3 |
If you see the output above all the values are correct except for the last row 27-05-2022 where i need to get count 2 only instead of 3
How do I get the cumulative distinct count of sa resource ids as of the modified dates based on deleted date condition(null/not null) and count should not change when deleted date (null/not null) occur consecutively
Advertisement
Answer
To me, a DATE has no hours, minutes, seconds, let alone second fractions, so I renamed the time containing attributes to %_ts, as they are TIMESTAMPs.
I had to completely start from scratch to solve it.
I think this is the first problem I had to solve with as much as 5 Common Table Expressions:
- Add a Boolean
is_activethat is never NULL - Add the previous obtained
is_activeusingLAG(). NULL here means there is no predecessor for the same resource id. - remove the rows whose previous
is_activeis equal to the currentis_active. UNION SELECTthe positiveCOUNT DISTINCTs of the active rows and the negativeCOUNT DISTINCTs of the inactive rows. This also removes the last timestamp.- get the distinct timestamps from the original input for the final query
The final query takes CTE 5 and LEFT JOINs it with CTE 4, making a running sum of the obtained distinct counts.
Here goes:
WITH
-- not part of the final query: this is your input data
indata(sa_resource_id,modified_ts,deleted_ts) AS (
SELECT 1,TIMESTAMP '2022-01-22 15:46:06.758',NULL
UNION ALL SELECT 2,TIMESTAMP '2022-01-22 15:46:06.758',NULL
UNION ALL SELECT 16,TIMESTAMP '2022-04-22 15:46:06.758',NULL
UNION ALL SELECT 17,TIMESTAMP '2022-04-22 15:46:06.758',NULL
UNION ALL SELECT 18,TIMESTAMP '2022-04-22 15:46:06.758',NULL
UNION ALL SELECT 16,TIMESTAMP '2022-04-29 15:46:06.758',TIMESTAMP '2022-04-29 15:46:06.758'
UNION ALL SELECT 17,TIMESTAMP '2022-04-29 15:46:06.758',TIMESTAMP '2022-04-29 15:46:06.758'
UNION ALL SELECT 1,TIMESTAMP '2022-05-22 15:46:06.758',TIMESTAMP '2022-05-22 15:46:06.758'
UNION ALL SELECT 2,TIMESTAMP '2022-05-22 15:46:06.758',TIMESTAMP '2022-05-22 15:46:06.758'
UNION ALL SELECT 1,TIMESTAMP '2022-05-23 22:16:06.758',NULL
UNION ALL SELECT 1,TIMESTAMP '2022-05-24 22:16:06.758',TIMESTAMP '2022-05-24 22:16:06.758'
UNION ALL SELECT 1,TIMESTAMP '2022-05-25 22:16:06.758',NULL
UNION ALL SELECT 1,TIMESTAMP '2022-05-27 22:16:06.758',NULL
)
-- real query starts here, replace the following comma with "WITH" ...
,
-- need a "active flag" that is never null
w_active_flag AS (
SELECT
*
, (deleted_ts IS NULL) AS is_active
FROM indata
)
,
-- need current and previous is_active to filter ..
w_prev_flag AS (
SELECT
*
, LAG(is_active) OVER w AS prev_flag
FROM w_active_flag
WINDOW w AS(PARTITION BY sa_resource_id ORDER BY modified_ts)
)
,
-- use obtained filter arguments to filter out two consecutive
-- active or non-active rows for same sa_resource_id
-- this can remove timestamps from the final result
de_duped AS (
SELECT
sa_resource_id
, modified_ts
, is_active
FROM w_prev_flag
WHERE prev_flag IS NULL OR prev_flag <> is_active
)
-- get count distinct "sa_resource_id" only now
,
grp AS (
SELECT
modified_ts
, COUNT(DISTINCT sa_resource_id) AS dca_agent_count
FROM de_duped
WHERE is_active
GROUP BY modified_ts
UNION ALL
SELECT
modified_ts
, COUNT(DISTINCT sa_resource_id) * -1 AS dca_agent_count
FROM de_duped
WHERE NOT is_active
GROUP BY modified_ts
)
,
-- get back all input timestamps in a help table
tslist AS (
SELECT DISTINCT
modified_ts
FROM indata
)
SELECT
tslist.modified_ts
, SUM(NVL(dca_agent_count,0)) OVER w AS dca_agent_count
FROM tslist LEFT JOIN grp USING(modified_ts)
WINDOW w AS (ORDER BY tslist.modified_ts);
-- out modified_ts | dca_agent_count
-- out -------------------------+-----------------
-- out 2022-01-22 15:46:06.758 | 2
-- out 2022-04-22 15:46:06.758 | 5
-- out 2022-04-29 15:46:06.758 | 3
-- out 2022-05-22 15:46:06.758 | 1
-- out 2022-05-23 22:16:06.758 | 2
-- out 2022-05-24 22:16:06.758 | 1
-- out 2022-05-25 22:16:06.758 | 2
-- out 2022-05-27 22:16:06.758 | 2