Skip to content
Advertisement

How to find duplicate records in MySQL, but with a degree of variance?

Assume I have the following table structure and data:

+------------------+-------------------------+--------+
| transaction_date | transaction_description | amount |
+------------------+-------------------------+--------+
| 2020-08-20       | Burger King             |  10.06 |
| 2020-08-23       | Burger King             |  10.06 |
| 2020-08-29       | McDonalds               |   6.48 |
| 2020-09-04       | Wendy's                 |   7.45 |
| 2020-09-05       | Dairy Queen             |  14.36 |
| 2020-09-06       | Wendy's                 |   7.45 |
| 2020-09-13       | Burger King             |  10.06 |
+------------------+-------------------------+--------+

I’d like to be able to find duplicate transactions where the description and amounts match, but the date would have some degree of variance +/- 3 days from each other.

Because the “Burger King” transactions are within three days of each other (2020-08-20 and 2020-08-23), they would be counted as duplicates, but the entry on 2020-09-13 would not be.

I have the following query so far, but the degree of variance piece is what’s stumping me.

SELECT t.transaction_date, t.transaction_description, t.amount
FROM transactions t
JOIN (SELECT transaction_date, transaction_description, amount, COUNT(*)
FROM transactions
GROUP BY transaction_description, amount
HAVING count(*) > 1 ) b
ON t.transaction_description = b.transaction_description
AND t.amount = b.amount
ORDER BY t.amount ASC;

Ideally, I’d love for the output to be something along the lines of:

+------------------+-------------------------+--------+
| transaction_date | transaction_description | amount |
+------------------+-------------------------+--------+
| 2020-08-20       | Burger King             |  10.06 |
| 2020-08-23       | Burger King             |  10.06 |
| 2020-09-04       | Wendy's                 |   7.45 |
| 2020-09-06       | Wendy's                 |   7.45 |
+------------------+-------------------------+--------+

Am I way off? Or is this even possible? Thanks in advance.

Advertisement

Answer

You can use exists:

select t.*
from mytable t
where exists (
    select 1
    from mytable t1
    where 
        t1.transaction_description = t.transaction_description
        and t1.transaction_date <> t.transaction_date 
        and t1.transaction_date >= t. transaction_date - interval 3 day
        and t1.transaction_date <= t. transaction_date + interval 3 day

If you are running MySQL 8.0, a count within a window date range is a reasonable alternative:

select t.*
from (
    select t.*,
        count(*) over(
            partition by transaction_description
            order by transaction_date
            range between interval 3 day preceding and interval 3 day following 
        ) cnt
    from mytable t
) t
where cnt > 1
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement