Skip to content
Advertisement

Get cumulative distinct count of active ids(ids where deleted date is null as of/before the modified date)

I am facing a problem while getting the cumulative distinct count of resource ids as of different modified dates in vertica. If you see the below table I have resource id, modified date and deleted date and I want to calculate the count of distinct active resources as of all unique modified dates. A resource is considered active when deleted date is null as of/before that modified date.

I was able to get the count when for a particular resource lets say resource id 1 the active count(deleted date null) or inactive count(deleted date not null) dont occur consecutively.

But when they occur consecutively I have to take the count as 1 till it becomes inactive and then I have to consider count as 0 for that resource id when it becomes inactive and all consecutive inactive values till it becomes active again. Likewise for all the distinct resource ids and cumulative sum of those.

sa_resource_id modified_date deleted_Date
1 2022-01-22 15:46:06.758
2 2022-01-22 15:46:06.758
16 2022-04-22 15:46:06.758
17 2022-04-22 15:46:06.758
18 2022-04-22 15:46:06.758
16 2022-04-29 15:46:06.758 2022-04-29 15:46:06.758
17 2022-04-29 15:46:06.758 2022-04-29 15:46:06.758
1 2022-05-22 15:46:06.758 2022-05-22 15:46:06.758
2 2022-05-22 15:46:06.758 2022-05-22 15:46:06.758
1 2022-05-23 22:16:06.758
1 2022-05-24 22:16:06.758 2022-05-24 22:16:06.758
1 2022-05-25 22:16:06.758
1 2022-05-27 22:16:06.758

This is the partition and sum query I have tried out where I partition the table based on resource ids and do sum over different modified dates.

Current Output:

md dca_agent_count
2022-01-22 15:46:06.758 2
2022-04-22 15:46:06.758 5
2022-04-29 15:46:06.758 3
2022-05-22 15:46:06.758 1
2022-05-23 22:16:06.758 2
2022-05-24 22:16:06.758 1
2022-05-25 22:16:06.758 2
2022-05-27 22:16:06.758 3

If you see the output above all the values are correct except for the last row 27-05-2022 where i need to get count 2 only instead of 3

How do I get the cumulative distinct count of sa resource ids as of the modified dates based on deleted date condition(null/not null) and count should not change when deleted date (null/not null) occur consecutively

Advertisement

Answer

To me, a DATE has no hours, minutes, seconds, let alone second fractions, so I renamed the time containing attributes to %_ts, as they are TIMESTAMPs.

I had to completely start from scratch to solve it.

I think this is the first problem I had to solve with as much as 5 Common Table Expressions:

  1. Add a Boolean is_active that is never NULL
  2. Add the previous obtained is_active using LAG(). NULL here means there is no predecessor for the same resource id.
  3. remove the rows whose previous is_active is equal to the current is_active.
  4. UNION SELECT the positive COUNT DISTINCTs of the active rows and the negative COUNT DISTINCTs of the inactive rows. This also removes the last timestamp.
  5. get the distinct timestamps from the original input for the final query

The final query takes CTE 5 and LEFT JOINs it with CTE 4, making a running sum of the obtained distinct counts.

Here goes:

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement