Drop duplicates records in a no index table using postgresql

Question

I have a table like as shown below Subject_id subject_name Standard Rank Previous_subject_id 13 ABC 1st 1 21 13 ABC 1st 1 23 ...

Accepted Answer

I don&#8217;t understand why you are using a JOIN when this seems as simple as:DELETE FROM subject_class WHERE subject_id = previous_subject_id?Also, 285,000 rows is not a large amount, and the performance should be fine.  However, 285,000 * 285,000 (81 billion) is a large number, and that&#8217;s essentially what your query with a JOIN is having to work through.Okay, now we have a problem.  In a relational database there&#8217;s no concept of &#8220;first&#8221; or &#8220;last&#8221;.  Rows don&#8217;t have any inherent order unless you tell them something to order by.  In your example, you have visually picked two rows to keep from the list purely on the basis that when you list them, this is the order they come out in.  However, that order is totally non-deterministic.  It&#8217;s actually likely to be the order that the data was inserted into the heap (non-indexed table) but that&#8217;s almost impossible to replicate, and way beyond the scope of this question.What I can do is to provide a deterministic way to remove rows.  Because this is more complex, I&#8217;ll set up some test data:DECLARE @subject_class TABLE (    subject_id INT,    subject_name VARCHAR(20),    [standard] VARCHAR(20),    [rank] INT,    previous_subject_id INT);INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;This is basically your setup, the data you listed in a table with no index.The first part is simple:DELETE FROM @subject_class WHERE subject_id = previous_subject_id; --fixes 2 recordsThe second part is slightly more complex, so I&#8217;m using a common-table expression:WITH cte AS (    SELECT        subject_id,        MIN(previous_subject_id) AS min_previous_subject_id    FROM        @subject_class    GROUP BY        subject_id)DELETE    sFROM    @subject_class s    INNER JOIN cte c ON c.subject_id = s.subject_id AND c.min_previous_subject_id != s.previous_subject_id;SELECT * FROM @subject_class;That works by first determining the minimum previous_subject_id for each subject_id and assuming that this is the only one we want to retain.  There&#8217;s many other ways to do this, you could pick the highest value, or come up with some, more complex rule.This doesn&#8217;t give you what you asked for, instead you get a result of:subject_id  subject_name    standard    rank    previous_subject_id13          ABC             1st         1       2125          def             3rd         6       28However, this is deterministic, in that you will get the same result every time you run the query.You wanted the query to only remove rows where there was a match on the &#8220;other&#8221; fields, so here goes:DECLARE @subject_class TABLE (    subject_id INT,    subject_name VARCHAR(20),    [standard] VARCHAR(20),    [rank] INT,    previous_subject_id INT);INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;INSERT INTO @subject_class SELECT 13, 'ABF', '1st', 1, 23;INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  INSERT INTO @subject_class SELECT 25, 'dez', '3rd', 6, 25;INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;DELETE FROM @subject_class WHERE subject_id = previous_subject_id;WITH cte AS (    SELECT        subject_id,        subject_name,        [standard],        [rank],        MIN(previous_subject_id) AS min_previous_subject_id    FROM        @subject_class    GROUP BY        subject_id,        subject_name,        [standard],        [rank])DELETE    sFROM    @subject_class s    INNER JOIN cte c ON c.subject_id = s.subject_id         AND c.subject_name = s.subject_name         AND c.[standard] = s.[standard]        AND c.[rank] = s.[rank]WHERE    c.min_previous_subject_id != s.previous_subject_id;SELECT * FROM @subject_class;This time we end up with 3 rows:&#8211; the row for &#8220;dez&#8221; is still deleted on the basis that it has the same subject_id and previous_subject_id;&#8211; the row for &#8220;ABF&#8221; is retained, on the basis that it doesn&#8217;t match the subject name.This time with your updated data:DECLARE @subject_class TABLE (    subject_id INT,    subject_name VARCHAR(20),    [standard] VARCHAR(20),    [rank] INT,    previous_subject_id INT);INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;INSERT INTO @subject_class SELECT 25, 'XYZ', '2nd', 7, 26;INSERT INTO @subject_class SELECT 29, 'PQR', '1st', 1, 31;DELETE FROM @subject_class WHERE subject_id = previous_subject_id;WITH cte AS (    SELECT        subject_id,        subject_name,        [standard],        [rank],        MIN(previous_subject_id) AS min_previous_subject_id    FROM        @subject_class    GROUP BY        subject_id,        subject_name,        [standard],        [rank])DELETE    sFROM    @subject_class s    INNER JOIN cte c ON c.subject_id = s.subject_id         AND c.subject_name = s.subject_name         AND c.[standard] = s.[standard]        AND c.[rank] = s.[rank]WHERE    c.min_previous_subject_id != s.previous_subject_id;SELECT * FROM @subject_class;I get results of:subject_id  subject_name    standard    rank    previous_subject_id13          ABC             1st         1       2125          def             3rd         6       2825          XYZ             2nd         7       2629          PQR             1st         1       31Which matches what you expected?  Well not quite, but that&#8217;s because you are still using the &#8220;first&#8221; when there&#8217;s no such concept.  I get the same number of rows, and the results are basically the same.  I just pick a different row to keep than you do.

Advertisement

Answer