Skip to content
Advertisement

Very slow (12+ hours) large table joins in postgres

I am struggling to optimize a simple LEFT JOIN against two very large tables that so far has been taking > 12 hours to complete and ongoing.

Here is the execution plan:

Gather  (cost=1001.26..11864143.06 rows=8972234 width=133)
  Workers Planned: 7
  ->  Nested Loop Left Join  (cost=1.26..10773657.51 rows=1281748 width=133)
        ->  Parallel Index Scan using var_case_aliquot_aliquot_ind on var_case_aliquot vca  (cost=0.56..464070.21 rows=1281748 width=103)
        ->  Index Scan using genotype_pos_ind on snv_genotypes gt  (cost=0.70..8.01 rows=1 width=65)
              Index Cond: ((vca.chrom = chrom) AND (vca.start = start) AND (vca.end = end) AND ((vca.alt)::text = (alt)::text))
              Filter: (vca.aliquot_barcode = aliquot_barcode)

Here is the query:

SELECT vca.aliquot_barcode,
    vca.case_barcode,
    vca.gene_symbol,
    vca.variant_classification,
    vca.variant_type,
    vca.chrom,
    int4range(vca.start::integer, vca."end"::integer, '[]'::text) AS pos,
    vca.alt,
    gt.called AS mutect2_call,
    gt.ref_count,
    gt.alt_count,
    gt.read_depth,
    gt.called OR
        CASE
            WHEN (gt.alt_count + gt.ref_count) > 0 THEN (gt.alt_count::numeric / (gt.alt_count + gt.ref_count)::numeric) > 0.20
            ELSE false
        END AS vaf_corrected_call
   FROM analysis.var_case_aliquot vca
     LEFT JOIN analysis.snv_genotypes gt ON vca.aliquot_barcode = gt.aliquot_barcode AND vca.chrom = gt.chrom AND vca.start = gt.start AND vca."end" = gt."end" AND vca.alt::text = gt.alt::text

Both tables are very large: vca and gt have 9 million (2 GB) and 1.3 billion rows (346 GB), respectively.

I created the vca (MATERIALIZED VIEW) for the sole purpose of performing this join. Essentially it’s a join table with only the required fields for a 1:1 matching left join and then some extra metadata. All the fields being joined on are properly indexed as you can see from the query plan.

The query itself is simple enough, is there something I’m missing that could speed it up? I don’t suppose there is some way to use WHERE instead?

Is there something I can tweak in my postgres settings that might help? Currently I have the following:

shared_buffers = 4096MB
effective_cache_size = 20GB
work_mem = 64MB
maintenance_work_mem = 4096MB
max_wal_size = 4GB
min_wal_size = 128MB
checkpoint_completion_target = 0.9
max_worker_processes = 16
max_parallel_workers_per_gather = 8
max_parallel_workers = 16

UPDATE 12/12:

Table DDL:

CREATE TABLE analysis.snv_genotypes (
    aliquot_barcode character(30) NOT NULL,
    chrom character(2) NOT NULL,
    start bigint NOT NULL,
    "end" bigint NOT NULL,
    alt character varying(510) NOT NULL,
    genotype character(3),
    read_depth integer,
    ref_count integer,
    alt_count integer,
    called boolean
);

ALTER TABLE ONLY analysis.snv_genotypes
    ADD CONSTRAINT genotype_pk PRIMARY KEY (aliquot_barcode, chrom, start, "end", alt);
CREATE INDEX called_ind ON analysis.snv_genotypes USING btree (called);
CREATE INDEX genotype_pos_ind ON analysis.snv_genotypes USING btree (chrom, start, "end", alt);

CREATE MATERIALIZED VIEW analysis.var_case_aliquot AS
 SELECT var_case_aliquot.aliquot_barcode,
    var_case_aliquot.case_barcode,
    var_case_aliquot.chrom,
    var_case_aliquot.start,
    var_case_aliquot."end",
    var_case_aliquot.alt,
    var_case_aliquot.gene_symbol,
    var_case_aliquot.variant_classification,
    var_case_aliquot.variant_type,
    var_case_aliquot.hgvs_p,
    var_case_aliquot.polyphen,
    var_case_aliquot.sift
   FROM var_case_aliquot
  WITH NO DATA;

CREATE INDEX var_case_aliquot_aliquot_ind ON analysis.var_case_aliquot USING btree (aliquot_barcode);
CREATE INDEX var_case_aliquot_pos_ind ON analysis.var_case_aliquot USING btree (chrom, start, "end", alt);

More extensive DDL here: https://rextester.com/JRJH43442

UPDATE 12/13:

To clarify, I am using Postgres 10.5 on CentOS 7.3 w/ 16 cores and 32 GB mem. The query has now been running for 24+ hrs without any result.

Checking on the status it seems that wait_event_type is IO. Does this mean the query is scratching/writing to scratch space? Could this explain the slowness?

+------------------+---------------+---------------+---------------+---------------+-----------------+--------------+--------+-------------+--------------+
| application_name | backend_start | xact_start    | query_start   | state_change  | wait_event_type | wait_event   | state  | backend_xid | backend_xmin |
+------------------+---------------+---------------+---------------+---------------+-----------------+--------------+--------+-------------+--------------+
| psql             | 12/12/18 8:42 | 12/12/18 8:42 | 12/12/18 8:42 | 12/12/18 8:42 | IO              | DataFileRead | active | 22135       | 22135        |
+------------------+---------------+---------------+---------------+---------------+-----------------+--------------+--------+-------------+--------------+

I have a lot of available resources:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:            31G        722M        210M        5.0G         30G         25G
Swap:          3.7G        626M        3.1G

I guess making more memory available could help? Is there some way to optimize queries that need more memory than available to them?

Advertisement

Answer

From this post’s comment:

Your query is using genotype_pos_indand filtering on aliquot_barcode. Try deleting (temporarly) genotype_pos_ind and if that doesn’t work, search how to force index usage.

Your query should be using genotype_pk instead.

From what you said, there might be a lot of record with the same values for aliquot_barcode, chrom, start and end, so the RDBMS will then take a long time to filter every aliquot_barcode.

And if it’s still too long for you, you can try my older answer, which I’ll keep for further references:



Unfortunately, I won’t ba able to optimize your query: there is too much things to take into account. Building a result with 9 millions record of 13 fields might be too much: swapping might occur, your OS won’t allow so much memory allocation, while also making JOIN, etc.. (writtent before the real answer…)

I used to optimize some query consisting of fifteen table of around 10 millions records. SELECT of this size would never be doable in reasonable time (less than 10 hours).

I don’t have any RDBMS to test what I’m saying. Also, I haven’t done any SQL for half a year :p Finding why this is taking so much time (as you asked) will be too much time consumming, so here is another solution to the original problem.


The solution I adopted was making temporary table:

  1. Create the temporary table: tmp_analysis, with the same fields as your SELECT + some utility fields:

An ID field (tmp_ID, a big int), a boolean to check if record has been updated (tmp_updated), and timestamp to check when it has been updated (tmp_update_time). And of course all fields, with the same datatypes, from your original SELECT (from vca and gt)

  1. Insert all your records from vca:

Use null (or any other default value if you can’t) for fields from gt for the moment. Set tmp_updated to false. Use a simple count() for the primary key.

  1. Update all these records with fields from gt.

Use a WHERE rather than a JOIN:

UPDATE tmp_analysis as tmp -- I don't think you need to use a schema to call tmp_analysis
    SET tmp_update = true,
    tmp_update_time = clock_timestamp(),
    tmp.mutect2_call = gt.called
    gt.ref_count,
    gt.alt_count,
    gt.read_depth,
    gt.called = -- ... (your CASE/WHEN/ELSE/END should work here)
FROM 
    analysis.snv_genotypes gt
WHERE --JOIN should work too
    tmp.aliquot_barcode = gt.aliquot_barcode AND 
    tmp.chrom = gt.chrom AND 
    vca.start = gt.start AND 
    tmp."end" = gt."end" AND 
    tmp.alt::text = gt.alt::text

I said that you should use EXISTS for performance reasons, but I was mistaken as I don’t think you can retreive fields from inside the EXISTS condition. There might be a way to tell Postgresql that it’s a one to one relationship, but I’m not sure. Anyway, index

  1. Obviously, SELECT your tmp_analysis table to get your records !

Some notes for this:

  1. If it’s taking too much time:

Use the tmp_ID field to limit the number of update to 10 000 for example and check the execution plan of the 3rd query (UPDATE): You should have a full scan on the temporary table table and an index scan on gt (on genotype_pk). If not, check your indexes and search how to force index use by PGSL. You should use WHERE tmp_ID < 10000 rather than LIMIT 10000. IIRC, LIMIT will execute the whole query and just give you part of the result.

  1. If it’s still taking too much time:

Segment the query using tmp_ID and (as you said) use a loop statement on the UPDATE to query with 100 000 or less records at once (again, use where tmp_ID < x AND tmp_ID > y). Check the execution plan again: the full scan should be limited by the tmp_id before the index scan. Don’t forger to add an index on this fild (if it’s not already the primary key).

  1. If you need to call this again later:

Use BEGIN/END TRANSACTION to encapsulate all the queries, and the TEMPORARY TABLE option on CREATE TABLE tmp_analysis so that you won’t have to clean tmp_analysis after executing the query.

  1. If you still have a problem with loops:

Use transactions inside the loop, and stop it if it freezes again. Then you can restore it later with a smaller loop size.

  1. If you want to reduce a little bit the execution time:

You can do step 1 and 2 in one query with a INSERT .. AS .. SELECT, but I don’t remember how to set datatype for fields from gt, as they’ll be set to null. Normally, this should be a little bit faster as a whole.

  1. If you’re curious:

And the query without the loop still takes more than 10 hours, stop it and checks the tmp_update_time to see how execution times evolves, maybe it’ll give you a clue about why the original query didn’t worked. There are multiple configuration options on PGSQL to limit RAM usage, disk usage, threads. Your OS might put it’s own limits, and check disk swapping, CPU cache usage, etc. (I think you’ve already done some of this but I didn’t check)

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement