Skip to content
Advertisement

Query to deduplicate based on one columns

My data looks like this in Teradata

| load_number | id        | time     | slot_time | region | network |
|-------------|-----------|----------|-----------|--------|---------|
| 1692        | 641131146 | 00:20:00 | 00:20:00  | FX-4   | SBOB    |
| 1692        | 641131146 | 00:20:00 | 00:20:30  | FX-4   | SBOB    |
| 1442        | 570732257 | 00:20:00 | 00:20:00  | FX-4   | SBOB    |
| 1442        | 570732257 | 00:20:00 | 00:20:30  | FX-4   | SBOB    |
| 1692        | 641131147 | 00:55:00 | 00:55:00  | FX-4   | SBOB    |
| 1692        | 641131147 | 00:55:00 | 00:55:30  | FX-4   | SBOB    |
| 1442        | 570732258 | 00:55:00 | 00:55:00  | FX-4   | SBOB    |
| 1442        | 570732258 | 00:55:00 | 00:55:30  | FX-4   | SBOB    |
| 14          | 123456    | 00:20:00 | 00:20:00  | ABC    | XYZ     |
| 14          | 123456    | 00:20:00 | 00:20:30  | ABC    | XYZ     |
| 19          | 234567    | 00:20:00 | 00:20:00  | ABC    | XYZ     |
| 19          | 234567    | 00:20:00 | 00:20:30  | ABC    | XYZ     |

How do I construct my sql query to pull the latest loaded data like so:

| load_number | id        | time     | slot_time | region | network |
|-------------|-----------|----------|-----------|--------|---------|
| 1692        | 641131146 | 00:20:00 | 00:20:00  | FX-4   | SBOB    |
| 1692        | 641131146 | 00:20:00 | 00:20:30  | FX-4   | SBOB    |
| 1692        | 641131147 | 00:55:00 | 00:55:00  | FX-4   | SBOB    |
| 1692        | 641131147 | 00:55:00 | 00:55:30  | FX-4   | SBOB    |
| 19          | 234567    | 00:20:00 | 00:20:00  | ABC    | XYZ     |
| 19          | 234567    | 00:20:00 | 00:20:30  | ABC    | XYZ     |

Basically every field will match except the id and the load_number. So given that every field matches except those two fields, can I remove ‘duplicates’ by taking the row with the higher load_number. The data can differ in the region and network columns, and also the load_number can be different as well.

I was thinking of some sort of descending rank() on the load_number or windowing over all columns that match on all fields but the id and load_number, and then taking the highest load_number , Any help is much appreciated!

Advertisement

Answer

If I understand correctly, you can use row_number() and qualify:

select t.*
from t
qualify row_number() over (partition by time, slot_time, region, network order by load_number desc) = 1
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement