Skip to content
Advertisement

Drop duplicates records in a no index table using postgresql

I have a table like as shown below

Subject_id  subject_name  Standard Rank Previous_subject_id
13              ABC            1st    1          21
13              ABC            1st    1          23   
13              ABC            1st    1          13
25              def            3rd    6          42   
25              def            3rd    6          25
25              def            3rd    6          28
25              XYZ            2nd    7          26
29              PQR            1st    1          31         

As you can see that all the columns and values are same except the previous_subject_id column (for one row).

rule 1

If there are still duplicates after rule 1, I would like to do is drop all who satisfy the condition of subject_id = previous_subject_id?

rule 2

If there are still duplicate subject_ids, then retain only the first (occurring) record

As you can see in the example output below that I have retained only the first occurring record.

I expect the output to be like as shown below

Subject_id  subject_name  Standard Rank Previous_subject_id
13              ABC            1st    1          21
25              def            3rd    6          42
25              XYZ            2nd    7          26
29              PQR            1st    1          31  

The only problem is my table has 285000 records and isn’t indexed. Once I drop the records, I will be able to set the index to subject_id as they become unique.

This is what I tried

select * from subject_class a
inner join 
subject_class b
on a.subject_id = b.previous_subject_id

Though the above query keeps running for a long time due to index issue, is there any efficient approach?

But how do I drop them?

Can help me with this please?

Advertisement

Answer

I don’t understand why you are using a JOIN when this seems as simple as:

DELETE FROM subject_class WHERE subject_id = previous_subject_id?

Also, 285,000 rows is not a large amount, and the performance should be fine. However, 285,000 * 285,000 (81 billion) is a large number, and that’s essentially what your query with a JOIN is having to work through.


Okay, now we have a problem. In a relational database there’s no concept of “first” or “last”. Rows don’t have any inherent order unless you tell them something to order by. In your example, you have visually picked two rows to keep from the list purely on the basis that when you list them, this is the order they come out in. However, that order is totally non-deterministic. It’s actually likely to be the order that the data was inserted into the heap (non-indexed table) but that’s almost impossible to replicate, and way beyond the scope of this question.

What I can do is to provide a deterministic way to remove rows. Because this is more complex, I’ll set up some test data:

DECLARE @subject_class TABLE (
    subject_id INT,
    subject_name VARCHAR(20),
    [standard] VARCHAR(20),
    [rank] INT,
    previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;

This is basically your setup, the data you listed in a table with no index.

The first part is simple:

DELETE FROM @subject_class WHERE subject_id = previous_subject_id; --fixes 2 records

The second part is slightly more complex, so I’m using a common-table expression:

WITH cte AS (
    SELECT
        subject_id,
        MIN(previous_subject_id) AS min_previous_subject_id
    FROM
        @subject_class
    GROUP BY
        subject_id)
DELETE
    s
FROM
    @subject_class s
    INNER JOIN cte c ON c.subject_id = s.subject_id AND c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;

That works by first determining the minimum previous_subject_id for each subject_id and assuming that this is the only one we want to retain. There’s many other ways to do this, you could pick the highest value, or come up with some, more complex rule.

This doesn’t give you what you asked for, instead you get a result of:

subject_id  subject_name    standard    rank    previous_subject_id
13          ABC             1st         1       21
25          def             3rd         6       28

However, this is deterministic, in that you will get the same result every time you run the query.


You wanted the query to only remove rows where there was a match on the “other” fields, so here goes:

DECLARE @subject_class TABLE (
    subject_id INT,
    subject_name VARCHAR(20),
    [standard] VARCHAR(20),
    [rank] INT,
    previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABF', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  
INSERT INTO @subject_class SELECT 25, 'dez', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;

DELETE FROM @subject_class WHERE subject_id = previous_subject_id;

WITH cte AS (
    SELECT
        subject_id,
        subject_name,
        [standard],
        [rank],
        MIN(previous_subject_id) AS min_previous_subject_id
    FROM
        @subject_class
    GROUP BY
        subject_id,
        subject_name,
        [standard],
        [rank])
DELETE
    s
FROM
    @subject_class s
    INNER JOIN cte c ON c.subject_id = s.subject_id 
        AND c.subject_name = s.subject_name 
        AND c.[standard] = s.[standard]
        AND c.[rank] = s.[rank]
WHERE
    c.min_previous_subject_id != s.previous_subject_id;

SELECT * FROM @subject_class;

This time we end up with 3 rows: – the row for “dez” is still deleted on the basis that it has the same subject_id and previous_subject_id; – the row for “ABF” is retained, on the basis that it doesn’t match the subject name.


This time with your updated data:

DECLARE @subject_class TABLE (
    subject_id INT,
    subject_name VARCHAR(20),
    [standard] VARCHAR(20),
    [rank] INT,
    previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
INSERT INTO @subject_class SELECT 25, 'XYZ', '2nd', 7, 26;
INSERT INTO @subject_class SELECT 29, 'PQR', '1st', 1, 31;

DELETE FROM @subject_class WHERE subject_id = previous_subject_id;

WITH cte AS (
    SELECT
        subject_id,
        subject_name,
        [standard],
        [rank],
        MIN(previous_subject_id) AS min_previous_subject_id
    FROM
        @subject_class
    GROUP BY
        subject_id,
        subject_name,
        [standard],
        [rank])
DELETE
    s
FROM
    @subject_class s
    INNER JOIN cte c ON c.subject_id = s.subject_id 
        AND c.subject_name = s.subject_name 
        AND c.[standard] = s.[standard]
        AND c.[rank] = s.[rank]
WHERE
    c.min_previous_subject_id != s.previous_subject_id;

SELECT * FROM @subject_class;

I get results of:

subject_id  subject_name    standard    rank    previous_subject_id
13          ABC             1st         1       21
25          def             3rd         6       28
25          XYZ             2nd         7       26
29          PQR             1st         1       31

Which matches what you expected? Well not quite, but that’s because you are still using the “first” when there’s no such concept. I get the same number of rows, and the results are basically the same. I just pick a different row to keep than you do.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement