I have a table like as shown below
Subject_id subject_name Standard Rank Previous_subject_id 13 ABC 1st 1 21 13 ABC 1st 1 23 13 ABC 1st 1 13 25 def 3rd 6 42 25 def 3rd 6 25 25 def 3rd 6 28 25 XYZ 2nd 7 26 29 PQR 1st 1 31
As you can see that all the columns and values are same except the previous_subject_id
column (for one row).
rule 1
If there are still duplicates after rule 1, I would like to do is drop all who satisfy the condition of subject_id = previous_subject_id
?
rule 2
If there are still duplicate subject_ids, then retain only the first (occurring) record
As you can see in the example output below that I have retained only the first occurring record.
I expect the output to be like as shown below
Subject_id subject_name Standard Rank Previous_subject_id 13 ABC 1st 1 21 25 def 3rd 6 42 25 XYZ 2nd 7 26 29 PQR 1st 1 31
The only problem is my table has 285000 records and isn’t indexed. Once I drop the records, I will be able to set the index to subject_id
as they become unique.
This is what I tried
select * from subject_class a inner join subject_class b on a.subject_id = b.previous_subject_id
Though the above query keeps running for a long time due to index issue, is there any efficient approach?
But how do I drop them?
Can help me with this please?
Advertisement
Answer
I don’t understand why you are using a JOIN
when this seems as simple as:
DELETE FROM subject_class WHERE subject_id = previous_subject_id
?
Also, 285,000 rows is not a large amount, and the performance should be fine. However, 285,000 * 285,000 (81 billion) is a large number, and that’s essentially what your query with a JOIN
is having to work through.
Okay, now we have a problem. In a relational database there’s no concept of “first” or “last”. Rows don’t have any inherent order unless you tell them something to order by. In your example, you have visually picked two rows to keep from the list purely on the basis that when you list them, this is the order they come out in. However, that order is totally non-deterministic. It’s actually likely to be the order that the data was inserted into the heap (non-indexed table) but that’s almost impossible to replicate, and way beyond the scope of this question.
What I can do is to provide a deterministic way to remove rows. Because this is more complex, I’ll set up some test data:
DECLARE @subject_class TABLE ( subject_id INT, subject_name VARCHAR(20), [standard] VARCHAR(20), [rank] INT, previous_subject_id INT); INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21; INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23; INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13; INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42; INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25; INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
This is basically your setup, the data you listed in a table with no index.
The first part is simple:
DELETE FROM @subject_class WHERE subject_id = previous_subject_id; --fixes 2 records
The second part is slightly more complex, so I’m using a common-table expression:
WITH cte AS ( SELECT subject_id, MIN(previous_subject_id) AS min_previous_subject_id FROM @subject_class GROUP BY subject_id) DELETE s FROM @subject_class s INNER JOIN cte c ON c.subject_id = s.subject_id AND c.min_previous_subject_id != s.previous_subject_id; SELECT * FROM @subject_class;
That works by first determining the minimum previous_subject_id
for each subject_id
and assuming that this is the only one we want to retain. There’s many other ways to do this, you could pick the highest value, or come up with some, more complex rule.
This doesn’t give you what you asked for, instead you get a result of:
subject_id subject_name standard rank previous_subject_id 13 ABC 1st 1 21 25 def 3rd 6 28
However, this is deterministic, in that you will get the same result every time you run the query.
You wanted the query to only remove rows where there was a match on the “other” fields, so here goes:
DECLARE @subject_class TABLE ( subject_id INT, subject_name VARCHAR(20), [standard] VARCHAR(20), [rank] INT, previous_subject_id INT); INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21; INSERT INTO @subject_class SELECT 13, 'ABF', '1st', 1, 23; INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13; INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42; INSERT INTO @subject_class SELECT 25, 'dez', '3rd', 6, 25; INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28; DELETE FROM @subject_class WHERE subject_id = previous_subject_id; WITH cte AS ( SELECT subject_id, subject_name, [standard], [rank], MIN(previous_subject_id) AS min_previous_subject_id FROM @subject_class GROUP BY subject_id, subject_name, [standard], [rank]) DELETE s FROM @subject_class s INNER JOIN cte c ON c.subject_id = s.subject_id AND c.subject_name = s.subject_name AND c.[standard] = s.[standard] AND c.[rank] = s.[rank] WHERE c.min_previous_subject_id != s.previous_subject_id; SELECT * FROM @subject_class;
This time we end up with 3 rows: – the row for “dez” is still deleted on the basis that it has the same subject_id and previous_subject_id; – the row for “ABF” is retained, on the basis that it doesn’t match the subject name.
This time with your updated data:
DECLARE @subject_class TABLE ( subject_id INT, subject_name VARCHAR(20), [standard] VARCHAR(20), [rank] INT, previous_subject_id INT); INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21; INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23; INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13; INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42; INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25; INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28; INSERT INTO @subject_class SELECT 25, 'XYZ', '2nd', 7, 26; INSERT INTO @subject_class SELECT 29, 'PQR', '1st', 1, 31; DELETE FROM @subject_class WHERE subject_id = previous_subject_id; WITH cte AS ( SELECT subject_id, subject_name, [standard], [rank], MIN(previous_subject_id) AS min_previous_subject_id FROM @subject_class GROUP BY subject_id, subject_name, [standard], [rank]) DELETE s FROM @subject_class s INNER JOIN cte c ON c.subject_id = s.subject_id AND c.subject_name = s.subject_name AND c.[standard] = s.[standard] AND c.[rank] = s.[rank] WHERE c.min_previous_subject_id != s.previous_subject_id; SELECT * FROM @subject_class;
I get results of:
subject_id subject_name standard rank previous_subject_id 13 ABC 1st 1 21 25 def 3rd 6 28 25 XYZ 2nd 7 26 29 PQR 1st 1 31
Which matches what you expected? Well not quite, but that’s because you are still using the “first” when there’s no such concept. I get the same number of rows, and the results are basically the same. I just pick a different row to keep than you do.