I have a set of financial transactions with a Date, Amount, Description and Source and I want to find transactions where the amount is the same, the date is within one day, but the sources are different. The source should be different because the transactions are an import from many sources and each source has unique entries.
For example, I would want to find that row 1 and 3 are duplicates:
'date','amount','description','source' 1/5/2018, 5.28, 'McDonalds', 'BankOfAmerica' 1/6/2018, 8.44, 'Starbucks', 'BankOfAmerica' 1/5/2018, 5.28, 'McDonalds Rest', 'BoA' 2/10/2018, 22.72, 'Chipolte', 'Chase' 3/10/2018, 4.58, 'Wendys', 'BoA'
I tried in Python and I can find duplicates with:
df_no_dups = df.drop_duplicates(subset=['amount','dates']) df_dups = df[~df.isin(df_no_dups)].dropna()
but this is an exact date match and then I have to run another script to make sure the sources were different.
I also tried to groupby amounts and then iterate inside those to find where dates are close and sources are different, but I couldn’t figure out the details of groups.
Other approaches could be with SQL or in the spreadsheet (google) where the transactions are.
Advertisement
Answer
use exists
select t1.* from table_name t1 where exists( select 1 from table_name t2 where t2.date=t1.date and t2.amount=t1.amount and t1.source<>t2.source)