Skip to content
Advertisement

Very slow (12+ hours) large table joins in postgres

I am struggling to optimize a simple LEFT JOIN against two very large tables that so far has been taking > 12 hours to complete and ongoing.

Here is the execution plan:

Here is the query:

Both tables are very large: vca and gt have 9 million (2 GB) and 1.3 billion rows (346 GB), respectively.

I created the vca (MATERIALIZED VIEW) for the sole purpose of performing this join. Essentially it’s a join table with only the required fields for a 1:1 matching left join and then some extra metadata. All the fields being joined on are properly indexed as you can see from the query plan.

The query itself is simple enough, is there something I’m missing that could speed it up? I don’t suppose there is some way to use WHERE instead?

Is there something I can tweak in my postgres settings that might help? Currently I have the following:

UPDATE 12/12:

Table DDL:

More extensive DDL here: https://rextester.com/JRJH43442

UPDATE 12/13:

To clarify, I am using Postgres 10.5 on CentOS 7.3 w/ 16 cores and 32 GB mem. The query has now been running for 24+ hrs without any result.

Checking on the status it seems that wait_event_type is IO. Does this mean the query is scratching/writing to scratch space? Could this explain the slowness?

I have a lot of available resources:

I guess making more memory available could help? Is there some way to optimize queries that need more memory than available to them?

Advertisement

Answer

From this post’s comment:

Your query is using genotype_pos_indand filtering on aliquot_barcode. Try deleting (temporarly) genotype_pos_ind and if that doesn’t work, search how to force index usage.

Your query should be using genotype_pk instead.

From what you said, there might be a lot of record with the same values for aliquot_barcode, chrom, start and end, so the RDBMS will then take a long time to filter every aliquot_barcode.

And if it’s still too long for you, you can try my older answer, which I’ll keep for further references:



Unfortunately, I won’t ba able to optimize your query: there is too much things to take into account. Building a result with 9 millions record of 13 fields might be too much: swapping might occur, your OS won’t allow so much memory allocation, while also making JOIN, etc.. (writtent before the real answer…)

I used to optimize some query consisting of fifteen table of around 10 millions records. SELECT of this size would never be doable in reasonable time (less than 10 hours).

I don’t have any RDBMS to test what I’m saying. Also, I haven’t done any SQL for half a year :p Finding why this is taking so much time (as you asked) will be too much time consumming, so here is another solution to the original problem.


The solution I adopted was making temporary table:

  1. Create the temporary table: tmp_analysis, with the same fields as your SELECT + some utility fields:

An ID field (tmp_ID, a big int), a boolean to check if record has been updated (tmp_updated), and timestamp to check when it has been updated (tmp_update_time). And of course all fields, with the same datatypes, from your original SELECT (from vca and gt)

  1. Insert all your records from vca:

Use null (or any other default value if you can’t) for fields from gt for the moment. Set tmp_updated to false. Use a simple count() for the primary key.

  1. Update all these records with fields from gt.

Use a WHERE rather than a JOIN:

I said that you should use EXISTS for performance reasons, but I was mistaken as I don’t think you can retreive fields from inside the EXISTS condition. There might be a way to tell Postgresql that it’s a one to one relationship, but I’m not sure. Anyway, index

  1. Obviously, SELECT your tmp_analysis table to get your records !

Some notes for this:

  1. If it’s taking too much time:

Use the tmp_ID field to limit the number of update to 10 000 for example and check the execution plan of the 3rd query (UPDATE): You should have a full scan on the temporary table table and an index scan on gt (on genotype_pk). If not, check your indexes and search how to force index use by PGSL. You should use WHERE tmp_ID < 10000 rather than LIMIT 10000. IIRC, LIMIT will execute the whole query and just give you part of the result.

  1. If it’s still taking too much time:

Segment the query using tmp_ID and (as you said) use a loop statement on the UPDATE to query with 100 000 or less records at once (again, use where tmp_ID < x AND tmp_ID > y). Check the execution plan again: the full scan should be limited by the tmp_id before the index scan. Don’t forger to add an index on this fild (if it’s not already the primary key).

  1. If you need to call this again later:

Use BEGIN/END TRANSACTION to encapsulate all the queries, and the TEMPORARY TABLE option on CREATE TABLE tmp_analysis so that you won’t have to clean tmp_analysis after executing the query.

  1. If you still have a problem with loops:

Use transactions inside the loop, and stop it if it freezes again. Then you can restore it later with a smaller loop size.

  1. If you want to reduce a little bit the execution time:

You can do step 1 and 2 in one query with a INSERT .. AS .. SELECT, but I don’t remember how to set datatype for fields from gt, as they’ll be set to null. Normally, this should be a little bit faster as a whole.

  1. If you’re curious:

And the query without the loop still takes more than 10 hours, stop it and checks the tmp_update_time to see how execution times evolves, maybe it’ll give you a clue about why the original query didn’t worked. There are multiple configuration options on PGSQL to limit RAM usage, disk usage, threads. Your OS might put it’s own limits, and check disk swapping, CPU cache usage, etc. (I think you’ve already done some of this but I didn’t check)

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement