How to find duplicate records in MySQL, but with a degree of variance?

Assume I have the following table structure and data:

+------------------+-------------------------+--------+
| transaction_date | transaction_description | amount |
+------------------+-------------------------+--------+
| 2020-08-20       | Burger King             |  10.06 |
| 2020-08-23       | Burger King             |  10.06 |
| 2020-08-29       | McDonalds               |   6.48 |
| 2020-09-04       | Wendy's                 |   7.45 |
| 2020-09-05       | Dairy Queen             |  14.36 |
| 2020-09-06       | Wendy's                 |   7.45 |
| 2020-09-13       | Burger King             |  10.06 |
+------------------+-------------------------+--------+

​x
 
+------------------+-------------------------+--------+| transaction_date | transaction_description | amount |+------------------+-------------------------+--------+| 2020-08-20       | Burger King             |  10.06 || 2020-08-23       | Burger King             |  10.06 || 2020-08-29       | McDonalds               |   6.48 || 2020-09-04       | Wendy's                 |   7.45 || 2020-09-05       | Dairy Queen             |  14.36 || 2020-09-06       | Wendy's                 |   7.45 || 2020-09-13       | Burger King             |  10.06 |+------------------+-------------------------+--------+​

I’d like to be able to find duplicate transactions where the description and amounts match, but the date would have some degree of variance +/- 3 days from each other.

Because the “Burger King” transactions are within three days of each other (2020-08-20 and 2020-08-23), they would be counted as duplicates, but the entry on 2020-09-13 would not be.

I have the following query so far, but the degree of variance piece is what’s stumping me.

SELECT t.transaction_date, t.transaction_description, t.amount
FROM transactions t
JOIN (SELECT transaction_date, transaction_description, amount, COUNT(*)
FROM transactions
GROUP BY transaction_description, amount
HAVING count(*) > 1 ) b
ON t.transaction_description = b.transaction_description
AND t.amount = b.amount
ORDER BY t.amount ASC;

 
SELECT t.transaction_date, t.transaction_description, t.amountFROM transactions tJOIN (SELECT transaction_date, transaction_description, amount, COUNT(*)FROM transactionsGROUP BY transaction_description, amountHAVING count(*) > 1 ) bON t.transaction_description = b.transaction_descriptionAND t.amount = b.amountORDER BY t.amount ASC;​

Ideally, I’d love for the output to be something along the lines of:

+------------------+-------------------------+--------+
| transaction_date | transaction_description | amount |
+------------------+-------------------------+--------+
| 2020-08-20       | Burger King             |  10.06 |
| 2020-08-23       | Burger King             |  10.06 |
| 2020-09-04       | Wendy's                 |   7.45 |
| 2020-09-06       | Wendy's                 |   7.45 |
+------------------+-------------------------+--------+

 
+------------------+-------------------------+--------+| transaction_date | transaction_description | amount |+------------------+-------------------------+--------+| 2020-08-20       | Burger King             |  10.06 || 2020-08-23       | Burger King             |  10.06 || 2020-09-04       | Wendy's                 |   7.45 || 2020-09-06       | Wendy's                 |   7.45 |+------------------+-------------------------+--------+​

Am I way off? Or is this even possible? Thanks in advance.

Answer

You can use exists:

select t.*
from mytable t
where exists (
    select 1
    from mytable t1
    where 
        t1.transaction_description = t.transaction_description
        and t1.transaction_date <> t.transaction_date 
        and t1.transaction_date >= t. transaction_date - interval 3 day
        and t1.transaction_date <= t. transaction_date + interval 3 day

 
select t.*from mytable twhere exists (    select 1    from mytable t1    where         t1.transaction_description = t.transaction_description        and t1.transaction_date <> t.transaction_date         and t1.transaction_date >= t. transaction_date - interval 3 day        and t1.transaction_date <= t. transaction_date + interval 3 day​

If you are running MySQL 8.0, a count within a window date range is a reasonable alternative:

select t.*
from (
    select t.*,
        count(*) over(
            partition by transaction_description
            order by transaction_date
            range between interval 3 day preceding and interval 3 day following 
        ) cnt
    from mytable t
) t
where cnt > 1

 
select t.*from (    select t.*,        count(*) over(            partition by transaction_description            order by transaction_date            range between interval 3 day preceding and interval 3 day following         ) cnt    from mytable t) twhere cnt > 1​

Advertisement

Answer