I have a table that appends data each day and records the imported date, however it appends duplicates.
My end goal here is to remove the duplicates based on the lowest imported column date.
Here would be the initial state of that table:
TABLE CLIENTS
Name | Surname | Imported |
---|---|---|
Bob | John | 18-07-2022 |
Marta | White | 18-07-2022 |
Ryan | Max | 18-07-2022 |
Bob | John | 20-07-2022 |
Marta | White | 20-07-2022 |
Ryan | Max | 20-07-2022 |
Brian | Red | 20-07-2022 |
Desired state:
Name | Surname | Imported |
---|---|---|
Bob | John | 18-07-2022 |
Marta | White | 18-07-2022 |
Ryan | Max | 18-07-2022 |
Brian | Red | 20-07-2022 |
I figured if I could create a view and match on all rows, where I would remove the MAX(imported) columns? Although not sure if this is the correct approach.
Once I create the view with the MAX columns:
CREATE VIEW CLIENTS_VIEW AS SELECT * FROM CLIENTS WHERE "Imported" = ( SELECT MAX("Imported") FROM CLIENTS);
Although Not sure how I could now match the view on the original table and only keep the lowest values per Imported column?
Advertisement
Answer
As mentioned in my comment, ideally each client would have a unique identifier. Lacking that, I am going to use name||’_’||surname as a pseudo primary key.
There’s a couple of approaches you could use here
The first is using a subquery to join on the key and imported date
CREATE VIEW CLIENTS_VIEW AS SELECT C.* FROM CLIENTS C JOIN ( SELECT name||'_'||surname as client_name , MAX(imported) as latest FROM CLIENTS GROUP BY 1 ) MI ON MI.client_name = C.name||'_'||surname AND MI.latest = C.imported
Another would be to use a row number function as per the other answer
CREATE VIEW CLIENTS_VIEW AS SELECT C.* FROM CLIENTS C QUALIFY row_number() over (partition by Name, surname order by imported desc)=1
In my experience, the subquery one is more performant if the amount of data is large
There is other alternatives, for example using NOT EXISTS, joining back onto the same table or using a CTE
The most performant option for larger tables would be to create another table with the latest data for each client (again a unique identifier would be needed) and periodically use MERGE to upsert new data.
Something like this
merge into clients_latest cl using (select * from clients) as c on cl.name||'_'||surname = c.name||'_'||surname when matched then update set cl.imported = c.imported when not matched then insert (name, surname, imported) values (c.name, c.surname, c.imported);
If this data is changed infrequently then a semi regular scheduled task could run this for you. If the table is constantly being appended then an append only table stream might be a quicker option as you would only then be upserting the new data since the last upsert