I have one table containing my members.
customer_id | name | age |
---|---|---|
1 | John | 74 |
2 | Sarah | 87 |
Everyday, I get a new table containing the current members.
- If a new member has joined, I want to add them.
- If a member has left, I want to nullify their name/id
- If a current member is still a member then I want to keep them as is.
Imagine that I get a new upload with the following rows
customer_id | name | age |
---|---|---|
2 | Sarah | 87 |
3 | Melvin | 23 |
I then want to generate the table
customer_id | name | age |
---|---|---|
Null | Null | 74 |
2 | Sarah | 87 |
3 | Melvin | 23 |
I don’t want to nullify anything by mistake and therefore I want to run a few tests on this table before I replace my old one. The way I’ve done this is by creating a temporary table (let’s call it customer_temp). However, I’ve now created a cyclical dependency since I:
- Need to read the live table customer in order to create the customer_temp
- Need to replace the live table customer with customer_temp after I’ve run my tests
Is there anyway I can do this using dbt?
Advertisement
Answer
Destroying data is tricky. I would avoid that unless it’s necessary (e.g., DSAR compliance).
Assuming the new data is loaded into the same table in your database each day, I think this is a perfect candidate for snapshots, with the option for invalidating hard-deleted records. See the docs. This allows you to capture the change history of a table without any data loss.
If you turned on snapshots in the initial state, your snapshot table would look like (assuming the existing records had a timestamp of 1/1):
customer_id | name | age | valid_from | valid_to |
---|---|---|---|---|
1 | John | 74 | 1/1/2022 | |
2 | Sarah | 87 | 1/1/2022 |
Then, after the source table was updated, re-running dbt snapshot
(today) would create this table:
customer_id | name | age | valid_from | valid_to |
---|---|---|---|---|
1 | John | 74 | 1/1/2022 | 5/12/2022 |
2 | Sarah | 87 | 1/1/2022 | |
3 | Melvin | 23 | 5/12/2022 |
You can create the format you’d like with a simple query:
select case when valid_to is null then customer_id else null end as customer_id, case when valid_to is null then name else null end as name, age from {{ ref('my_snapshot') }}