I have been given this task to try to detect some duplicate records in a table with a large volume of rows. The table comprises 2 joined tables. So to begin with I have:
select b.event_number_id, b.tenure_number_id, a.work_start_date, a.work_stop_date, a.amount from MTA.mta_sow_event a, mta_tenure_event_xref b where a.event_number_id = b.event_number_id
Now we have a table to work from. The duplicate records have unique event_number_id, the reamining fields will contain identical data, so something like this:
| event_number_id | tenure_number_id | work_start_date | work_stop_date |amount| |-----------------|-------------------|-----------------|----------------|------| | 5532733 | 688203 | 01-SEP-14 | 25-SEP-14 | 5000 | | 5532734 | 688203 | 01-SEP-14 | 25-SEP-14 | 5000 |
So, this is an example of a duplicate record. There are consecutive event_number_id’s and all the remaining columns have identical information. We believe that our system has been creating duplicate events for some time now (this isn’t supposed to happen), so I want to query the whole joined table and find anything that has rows that have exactly the same data, but different and consecutive event numbers.
So far, I managed to make a simple query that shows me any rows that have identical information, excluding the event_number_id column:
select b.tenure_number_id, a.work_start_date, a.work_stop_date, a.amount, count(*) from MTA.mta_sow_event a, mta_tenure_event_xref b where a.event_number_id = b.event_number_id group by b.tenure_number_id, a.work_start_date, a.work_stop_date, a.amount having count(*) > 1
which returns:
| tenure_number_id | work_start_date | work_stop_date |amount|Count(*)| |-------------------|-----------------|----------------|------|--------| | 688203 | 01-SEP-14 | 25-SEP-14 | 5000 | 2 |
The problem is, sometimes there are rows that have identical data, but could be valid, so the best we can do at this point is find any of these matching rows that have consecutive event_number_id’s. This is where I am hung up. Is there a way to pull out only the rows that contain these consecutive numbers?
Advertisement
Answer
Here’s an approach based on a join of the data sets:
with cte_base_data as ( select ... your query here ...) select from cte_base_data t1 join cte_base_data t2 on (t1.tenure_number_id = t2.tenure_number_id and t1.work_start_date = t2.work_start_date and t1.work_stop_date = t2.work_stop_date and t1.amount = t2.amount) where t1.event_number_id = t2.event_number_id - 1;
The efficiency will depend on a few factors, such as the efficiency of scanning the base tables and the size of the data sets.
It would be interesting to see a comparison of the execution plans of this method and the analytics-function methods. This common table expression-based join ought to be very efficient as it depends on hash joins, which have almost no cost as long as they stay in memory (a big question mark over that).
I’d be inclined to go for the analytic functions if the event_number_id’s were not consecutive — if there might be gaps, for instance, which would be harder to implement as a join. Given that one of them is the other incremented, I think it’s worth taking a punt on a join.