Skip to content
Advertisement

Drop duplicates records in a no index table using postgresql

I have a table like as shown below

As you can see that all the columns and values are same except the previous_subject_id column (for one row).

rule 1

If there are still duplicates after rule 1, I would like to do is drop all who satisfy the condition of subject_id = previous_subject_id?

rule 2

If there are still duplicate subject_ids, then retain only the first (occurring) record

As you can see in the example output below that I have retained only the first occurring record.

I expect the output to be like as shown below

The only problem is my table has 285000 records and isn’t indexed. Once I drop the records, I will be able to set the index to subject_id as they become unique.

This is what I tried

Though the above query keeps running for a long time due to index issue, is there any efficient approach?

But how do I drop them?

Can help me with this please?

Advertisement

Answer

I don’t understand why you are using a JOIN when this seems as simple as:

DELETE FROM subject_class WHERE subject_id = previous_subject_id?

Also, 285,000 rows is not a large amount, and the performance should be fine. However, 285,000 * 285,000 (81 billion) is a large number, and that’s essentially what your query with a JOIN is having to work through.


Okay, now we have a problem. In a relational database there’s no concept of “first” or “last”. Rows don’t have any inherent order unless you tell them something to order by. In your example, you have visually picked two rows to keep from the list purely on the basis that when you list them, this is the order they come out in. However, that order is totally non-deterministic. It’s actually likely to be the order that the data was inserted into the heap (non-indexed table) but that’s almost impossible to replicate, and way beyond the scope of this question.

What I can do is to provide a deterministic way to remove rows. Because this is more complex, I’ll set up some test data:

This is basically your setup, the data you listed in a table with no index.

The first part is simple:

The second part is slightly more complex, so I’m using a common-table expression:

That works by first determining the minimum previous_subject_id for each subject_id and assuming that this is the only one we want to retain. There’s many other ways to do this, you could pick the highest value, or come up with some, more complex rule.

This doesn’t give you what you asked for, instead you get a result of:

However, this is deterministic, in that you will get the same result every time you run the query.


You wanted the query to only remove rows where there was a match on the “other” fields, so here goes:

This time we end up with 3 rows: – the row for “dez” is still deleted on the basis that it has the same subject_id and previous_subject_id; – the row for “ABF” is retained, on the basis that it doesn’t match the subject name.


This time with your updated data:

I get results of:

Which matches what you expected? Well not quite, but that’s because you are still using the “first” when there’s no such concept. I get the same number of rows, and the results are basically the same. I just pick a different row to keep than you do.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement