Skip to content
Advertisement

Creating a VIEW to remove duplicates from table based on max date

I have a table that appends data each day and records the imported date, however it appends duplicates.

My end goal here is to remove the duplicates based on the lowest imported column date.

Here would be the initial state of that table:

TABLE CLIENTS

Name Surname Imported
Bob John 18-07-2022
Marta White 18-07-2022
Ryan Max 18-07-2022
Bob John 20-07-2022
Marta White 20-07-2022
Ryan Max 20-07-2022
Brian Red 20-07-2022

Desired state:

Name Surname Imported
Bob John 18-07-2022
Marta White 18-07-2022
Ryan Max 18-07-2022
Brian Red 20-07-2022

I figured if I could create a view and match on all rows, where I would remove the MAX(imported) columns? Although not sure if this is the correct approach.

Once I create the view with the MAX columns:

CREATE VIEW CLIENTS_VIEW AS
SELECT * FROM CLIENTS
WHERE "Imported" = ( SELECT MAX("Imported") FROM CLIENTS);

Although Not sure how I could now match the view on the original table and only keep the lowest values per Imported column?

Advertisement

Answer

As mentioned in my comment, ideally each client would have a unique identifier. Lacking that, I am going to use name||’_’||surname as a pseudo primary key.

There’s a couple of approaches you could use here

The first is using a subquery to join on the key and imported date

CREATE VIEW CLIENTS_VIEW AS
SELECT C.* FROM CLIENTS C
JOIN
    (
SELECT 
name||'_'||surname as client_name
, MAX(imported) as latest
FROM CLIENTS
GROUP BY 1
     ) MI ON MI.client_name = C.name||'_'||surname AND MI.latest = C.imported

Another would be to use a row number function as per the other answer

CREATE VIEW CLIENTS_VIEW AS
SELECT C.* FROM CLIENTS C
QUALIFY row_number() over (partition by Name, surname order by imported desc)=1

In my experience, the subquery one is more performant if the amount of data is large

There is other alternatives, for example using NOT EXISTS, joining back onto the same table or using a CTE

The most performant option for larger tables would be to create another table with the latest data for each client (again a unique identifier would be needed) and periodically use MERGE to upsert new data.

Something like this

merge into clients_latest cl using (select * from clients) as c on 
cl.name||'_'||surname = c.name||'_'||surname
when matched then update set cl.imported = c.imported
when not matched then insert (name, surname, imported) values (c.name, 
c.surname, c.imported);

If this data is changed infrequently then a semi regular scheduled task could run this for you. If the table is constantly being appended then an append only table stream might be a quicker option as you would only then be upserting the new data since the last upsert

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement