I have the following structure for the table DataTable
: every column is of the datatype int, RowID
is an identity column and the primary key. LinkID
is a foreign key and links to rows of an other table.
RowID LinkID Order Data DataSpecifier 1 120 1 1 1 2 120 2 1 3 3 120 3 1 10 4 120 4 1 13 5 120 5 1 10 6 120 6 1 13 7 371 1 6 2 8 371 2 3 5 9 371 3 8 1 10 371 4 10 1 11 371 5 7 2 12 371 6 3 3 13 371 7 7 2 14 371 8 17 4 ................................. .................................
I’m trying to do a query which alters every LinkID
batch in the following way:
- Take every row with same
LinkID
(e.g. the first batch is the first 6 rows here) - Order them by the
Order
column - Look at
Data
andDataSpecifier
columns as one compare unit (They can be thought as one column, calleddataunit
):- Keep as many rows from
Order
1 onwards, until a duplicatedataunit
comes by - Delete every row from that first duplicate onwards for that
LinkID
- Keep as many rows from
So for the LinkID
120
:
- Sort the batch (already sorted here, but should still do it)
- Start looking from the top (So
Order=1
here), go as long as you don’t see a duplicate. - Stop at the first duplicate
Order = 5
(dataunit
1 10
was already seen). - Delete everything which has the
LinkID=120 AND Order>=5
After similar process for LinkID
371
(and every other LinkID
in the table), the processed table will look like this:
RowID LinkID Order Data DataSpecifier 1 120 1 1 1 2 120 2 1 3 3 120 3 1 10 4 120 4 1 13 7 371 1 6 2 8 371 2 3 5 9 371 3 8 1 10 371 4 10 1 11 371 5 7 2 12 371 6 3 3 ................................. .................................
I’ve done quite a lot of SQL queries, but never something this complicated. I know I need to use a query which is something like this:
DELETE FROM DataTable WHERE RowID IN (SELECT RowID FROM DataTable WHERE -- ? GROUP BY LinkID HAVING COUNT(*) > 1 -- ? ORDER BY [Order]);
But I just can’t seem to wrap my head around this and get the query right. I would preferably do this in pure SQL, with one executable (and reusable) query.
Advertisement
Answer
You can use the ROW_NUMBER()
window function to identify any rows that come after the original. After that you can delete and rows with a matching LinkID
and a greater than or equal to any encountered Order
with a row number greater than one.
(I originally used a second CTE to get the MIN order
, but I realized that it wasn’t necessary as long as the join to order
was greater than equal to any order
where there was a second instance of the DataUnitId. By removing the MIN
the query plan became quite simple and efficient.)
WITH DataUnitInstances AS ( SELECT * , ROW_NUMBER() OVER (PARTITION BY LinkID, [Data], [DataSpecifier] ORDER BY [Order]) DataUnitInstanceId FROM DataTable ) DELETE FROM DataTable FROM DataTable dt INNER JOIN DataUnitInstances dup ON dup.LinkID = dt.LinkID AND dup.[Order] <= dt.[Order] AND dup.DataUnitInstanceId > 1
Here is the output from your sample data which matches your desired result:
+-------+--------+-------+------+---------------+ | RowID | LinkID | Order | Data | DataSpecifier | +-------+--------+-------+------+---------------+ | 1 | 120 | 1 | 1 | 1 | | 2 | 120 | 2 | 1 | 3 | | 3 | 120 | 3 | 1 | 10 | | 4 | 120 | 4 | 1 | 13 | | 7 | 371 | 1 | 6 | 2 | | 8 | 371 | 2 | 3 | 5 | | 9 | 371 | 3 | 8 | 1 | | 10 | 371 | 4 | 10 | 1 | | 11 | 371 | 5 | 7 | 2 | | 12 | 371 | 6 | 3 | 3 | +-------+--------+-------+------+---------------+