Very slow (12+ hours) large table joins in postgres

Question

I am struggling to optimize a simple LEFT JOIN against two very large tables that so far has been taking > 12 hours to complete and ongoing. Here is the execution plan: Gather (cost=1001.26&#8230;..

Accepted Answer

From this post&#8217;s comment:  Your query is using genotype_pos_indand filtering on aliquot_barcode. Try deleting (temporarly) genotype_pos_ind and if that doesn&#8217;t work, search how to force index usage.Your query should be using genotype_pk instead.From what you said, there might be a lot of record with the same values for aliquot_barcode, chrom, start and end, so the RDBMS will then take a long time to filter every aliquot_barcode. And if it&#8217;s still too long for you, you can try my older answer, which I&#8217;ll keep for further references:Unfortunately, I won&#8217;t ba able to optimize your query: there is too much things to take into account. Building a result with 9 millions record of 13 fields might be too much: swapping might occur, your OS won&#8217;t allow so much memory allocation, while also making JOIN, etc..   (writtent before the real answer&#8230;)I used to optimize some query consisting of fifteen table of around 10 millions records. SELECT of this size would never be doable in reasonable time (less than 10 hours).I don&#8217;t have any RDBMS to test what I&#8217;m saying. Also, I haven&#8217;t done any SQL for half a year :p Finding why this is taking so much time (as you asked) will be too much time consumming, so here is another solution to the original problem.The solution I adopted was making temporary table:Create the temporary table: tmp_analysis, with the same fields as your SELECT + some utility fields:An ID field (tmp_ID, a big int), a boolean to check if record has been updated (tmp_updated), and timestamp to check when it has been updated (tmp_update_time).And of course all fields, with the same datatypes, from your original SELECT (from vca and gt)Insert all your records from vca:Use null (or any other default value if you can&#8217;t) for fields from gt for the moment. Set tmp_updated to false. Use a simple count() for the primary key.Update all these records with fields from gt.Use a WHERE rather than a JOIN:UPDATE tmp_analysis as tmp -- I don't think you need to use a schema to call tmp_analysis    SET tmp_update = true,    tmp_update_time = clock_timestamp(),    tmp.mutect2_call = gt.called    gt.ref_count,    gt.alt_count,    gt.read_depth,    gt.called = -- ... (your CASE/WHEN/ELSE/END should work here)FROM     analysis.snv_genotypes gtWHERE --JOIN should work too    tmp.aliquot_barcode = gt.aliquot_barcode AND     tmp.chrom = gt.chrom AND     vca.start = gt.start AND     tmp."end" = gt."end" AND     tmp.alt::text = gt.alt::textI said that you should use EXISTS for performance reasons, but I was mistaken as I don&#8217;t think you can retreive fields from inside the EXISTS condition. There might be a way to tell Postgresql that it&#8217;s a one to one relationship, but I&#8217;m not sure. Anyway, index Obviously, SELECT your tmp_analysis table to get your records !Some notes for this:If it&#8217;s taking too much time:Use the tmp_ID field to limit the number of update to 10 000 for example and check the execution plan of the 3rd query (UPDATE): You should have a full scan on the temporary table table and an index scan on gt (on genotype_pk). If not, check your indexes and search how to force index use by PGSL. You should use WHERE tmp_ID < 10000 rather than LIMIT 10000. IIRC, LIMIT will execute the whole query and just give you part of the result.If it&#8217;s still taking too much time:Segment the query using tmp_ID and (as you said) use a loop statement on the UPDATE to query with 100 000 or less records at once (again, use where tmp_ID < x AND tmp_ID > y). Check the execution plan again: the full scan should be limited by the tmp_id before the index scan. Don&#8217;t forger to add an index on this fild (if it&#8217;s not already the primary key).If you need to call this again later:Use BEGIN/END TRANSACTION to encapsulate all the queries, and the TEMPORARY TABLE option on CREATE TABLE tmp_analysis so that you won&#8217;t have to clean tmp_analysis after executing the query.If you still have a problem with loops:Use transactions inside the loop, and stop it if it freezes again. Then you can restore it later with a smaller loop size.If you want to reduce a little bit the execution time:You can do step 1 and 2 in one query with a INSERT .. AS .. SELECT, but I don&#8217;t remember how to set datatype for fields from gt, as they&#8217;ll be set to null. Normally, this should be a little bit faster as a whole.If you&#8217;re curious:And the query without the loop still takes more than 10 hours, stop it and checks the tmp_update_time to see how execution times evolves, maybe it&#8217;ll give you a clue about why the original query didn&#8217;t worked. There are multiple configuration options on PGSQL to limit RAM usage, disk usage, threads. Your OS might put it&#8217;s own limits, and check disk swapping, CPU cache usage, etc. (I think you&#8217;ve already done some of this but I didn&#8217;t check)

Advertisement

Answer