I have two tables where one is holding “raw” data and another is holding “updated” data. The updated data just contains corrections of rows from the first table, but is essentially the same. It is a functional requirement for this data to be stored separately.
I want a query with the following conditions:
- Select all rows from the first table
- If there is a matching row in the second table (ie. when
raw_d.primary_key_col_1 = edit_d.primary_key_col_1
andraw_d.primary_key_col_2 = edit_d.primary_key_col_2
), we use the most recent (where most recent is based on columnprimary_key_col_3
values from the second table, rather than the first - Otherwise we use the values from the first table.
Note: I have many more “value” columns in the actual data. Considering the following toy example where I have two tables, raw_d
and edit_d
, that are quite similar as follows:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2 -------------------------+-------------------------+-------------------+------------------- src_1 | dest_1 | 0 | 1 src_2 | dest_2 | 5 | 4 src_3 | dest_3 | 2 | 2 src_4 | dest_4 | 6 | 3 src_5 | dest_5 | 9 | 9 primary_key_col_1 | primary_key_col_2 | primary_key_col_3 | value_col_1 | value_col_2 -------------------------+-------------------------+-------------------------+--------------------------------------- src_1 | dest_1 | 2020-05-09 | 7 | 0 src_2 | dest_2 | 2020-05-08 | 6 | 1 src_3 | dest_3 | 2020-05-07 | 5 | 2 src_1 | dest_1 | 2020-05-08 | 3 | 4 src_2 | dest_2 | 2020-05-09 | 2 | 5
The expected result is as given:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2 -------------------------+-------------------------+-------------------+------------------- src_1 | dest_1 | 7 | 0 src_2 | dest_2 | 2 | 5 src_3 | dest_3 | 5 | 2 src_4 | dest_4 | 6 | 3 src_5 | dest_5 | 9 | 9
My proposed solution is to query the “greatest n per group” with the second table and then “overwrite” rows in a query of the first table, using Pandas.
The first query would just grab data from the first table:
SELECT * FROM raw_d
The second query to select “the greatest n per group” would be as follows:
SELECT DISTINCT ON (primary_key_col_1, primary_key_col_2) * FROM edit_d ORDER BY primary_key_col_1, primary_key_col_2, primary_key_col_3 DESC;
I planned on merging the data like in Replace column values based on another dataframe python pandas – better way?.
Does anyone know a better solution, preferably using SQL only? For reference, I am using PostgreSQL and Pandas as part of my data stack.
Advertisement
Answer
I would suggest phrasing the requirements as:
- select the most recent row from the second table
- bring in additional rows from the first table that don’t match
This is a union all
with distinct on
:
(select distinct on (primary_key_col_1, primary_key_col_2) u.primary_key_col_1, u.primary_key_col_2, u.value_col_1, u.value_col_2 from updated u order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc ) union all select r.primary_key_col_1, r.primary_key_col_2, r.value_col_1, r.value_col_2 from raw r where not exists (select 1 from updated u where u.primary_key_col_1 = r.primary_key_col_2 and u.primary_key_col_2 = r.primary_key_col_2 );