Skip to content
Advertisement

How to find consecutive values in column

I have been given this task to try to detect some duplicate records in a table with a large volume of rows. The table comprises 2 joined tables. So to begin with I have:

select 
b.event_number_id, b.tenure_number_id, a.work_start_date, a.work_stop_date, a.amount
from
MTA.mta_sow_event a, mta_tenure_event_xref b
where
a.event_number_id = b.event_number_id

Now we have a table to work from. The duplicate records have unique event_number_id, the reamining fields will contain identical data, so something like this:

| event_number_id |  tenure_number_id | work_start_date | work_stop_date |amount|
|-----------------|-------------------|-----------------|----------------|------|
|  5532733        | 688203            |     01-SEP-14   |  25-SEP-14     | 5000 |
|  5532734        | 688203            |     01-SEP-14   |  25-SEP-14     | 5000 |

So, this is an example of a duplicate record. There are consecutive event_number_id’s and all the remaining columns have identical information. We believe that our system has been creating duplicate events for some time now (this isn’t supposed to happen), so I want to query the whole joined table and find anything that has rows that have exactly the same data, but different and consecutive event numbers.

So far, I managed to make a simple query that shows me any rows that have identical information, excluding the event_number_id column:

select 
b.tenure_number_id, a.work_start_date, a.work_stop_date, a.amount, count(*)
from
MTA.mta_sow_event a, mta_tenure_event_xref b
where
a.event_number_id = b.event_number_id
group by
b.tenure_number_id, a.work_start_date, a.work_stop_date, a.amount
having
count(*) > 1

which returns:

|  tenure_number_id | work_start_date | work_stop_date |amount|Count(*)|
|-------------------|-----------------|----------------|------|--------|
| 688203            |     01-SEP-14   |  25-SEP-14     | 5000 |   2    |

The problem is, sometimes there are rows that have identical data, but could be valid, so the best we can do at this point is find any of these matching rows that have consecutive event_number_id’s. This is where I am hung up. Is there a way to pull out only the rows that contain these consecutive numbers?

Advertisement

Answer

Here’s an approach based on a join of the data sets:

with cte_base_data as (
  select 
     ... your query here ...)
select 
from cte_base_data t1 join
     cte_base_data t2 on (t1.tenure_number_id = t2.tenure_number_id and
                          t1.work_start_date  = t2.work_start_date  and
                          t1.work_stop_date   = t2.work_stop_date   and
                          t1.amount           = t2.amount)
where t1.event_number_id = t2.event_number_id - 1;

The efficiency will depend on a few factors, such as the efficiency of scanning the base tables and the size of the data sets.

It would be interesting to see a comparison of the execution plans of this method and the analytics-function methods. This common table expression-based join ought to be very efficient as it depends on hash joins, which have almost no cost as long as they stay in memory (a big question mark over that).

I’d be inclined to go for the analytic functions if the event_number_id’s were not consecutive — if there might be gaps, for instance, which would be harder to implement as a join. Given that one of them is the other incremented, I think it’s worth taking a punt on a join.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement