Skip to content
Advertisement

Find Duplicates on Multiple Criteria

I have a set of financial transactions with a Date, Amount, Description and Source and I want to find transactions where the amount is the same, the date is within one day, but the sources are different. The source should be different because the transactions are an import from many sources and each source has unique entries.

For example, I would want to find that row 1 and 3 are duplicates:

'date','amount','description','source'
1/5/2018, 5.28, 'McDonalds', 'BankOfAmerica'
1/6/2018, 8.44, 'Starbucks', 'BankOfAmerica'
1/5/2018, 5.28, 'McDonalds Rest', 'BoA'
2/10/2018, 22.72, 'Chipolte', 'Chase'
3/10/2018, 4.58, 'Wendys', 'BoA'

I tried in Python and I can find duplicates with:

df_no_dups = df.drop_duplicates(subset=['amount','dates'])
df_dups = df[~df.isin(df_no_dups)].dropna()

but this is an exact date match and then I have to run another script to make sure the sources were different.

I also tried to groupby amounts and then iterate inside those to find where dates are close and sources are different, but I couldn’t figure out the details of groups.

Other approaches could be with SQL or in the spreadsheet (google) where the transactions are.

Advertisement

Answer

use exists

select t1.* from table_name t1
where exists( select 1 from table_name t2 
             where t2.date=t1.date and t2.amount=t1.amount and t1.source<>t2.source)
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement