I am struggling to optimize a simple LEFT JOIN
against two very large tables that so far has been taking > 12 hours to complete and ongoing.
Here is the execution plan:
Gather (cost=1001.26..11864143.06 rows=8972234 width=133) Workers Planned: 7 -> Nested Loop Left Join (cost=1.26..10773657.51 rows=1281748 width=133) -> Parallel Index Scan using var_case_aliquot_aliquot_ind on var_case_aliquot vca (cost=0.56..464070.21 rows=1281748 width=103) -> Index Scan using genotype_pos_ind on snv_genotypes gt (cost=0.70..8.01 rows=1 width=65) Index Cond: ((vca.chrom = chrom) AND (vca.start = start) AND (vca.end = end) AND ((vca.alt)::text = (alt)::text)) Filter: (vca.aliquot_barcode = aliquot_barcode)
Here is the query:
SELECT vca.aliquot_barcode, vca.case_barcode, vca.gene_symbol, vca.variant_classification, vca.variant_type, vca.chrom, int4range(vca.start::integer, vca."end"::integer, '[]'::text) AS pos, vca.alt, gt.called AS mutect2_call, gt.ref_count, gt.alt_count, gt.read_depth, gt.called OR CASE WHEN (gt.alt_count + gt.ref_count) > 0 THEN (gt.alt_count::numeric / (gt.alt_count + gt.ref_count)::numeric) > 0.20 ELSE false END AS vaf_corrected_call FROM analysis.var_case_aliquot vca LEFT JOIN analysis.snv_genotypes gt ON vca.aliquot_barcode = gt.aliquot_barcode AND vca.chrom = gt.chrom AND vca.start = gt.start AND vca."end" = gt."end" AND vca.alt::text = gt.alt::text
Both tables are very large: vca
and gt
have 9 million (2 GB) and 1.3 billion rows (346 GB), respectively.
I created the vca
(MATERIALIZED VIEW
) for the sole purpose of performing this join. Essentially it’s a join table with only the required fields for a 1:1 matching left join and then some extra metadata. All the fields being joined on are properly indexed as you can see from the query plan.
The query itself is simple enough, is there something I’m missing that could speed it up? I don’t suppose there is some way to use WHERE
instead?
Is there something I can tweak in my postgres settings that might help? Currently I have the following:
shared_buffers = 4096MB effective_cache_size = 20GB work_mem = 64MB maintenance_work_mem = 4096MB max_wal_size = 4GB min_wal_size = 128MB checkpoint_completion_target = 0.9 max_worker_processes = 16 max_parallel_workers_per_gather = 8 max_parallel_workers = 16
UPDATE 12/12:
Table DDL:
CREATE TABLE analysis.snv_genotypes ( aliquot_barcode character(30) NOT NULL, chrom character(2) NOT NULL, start bigint NOT NULL, "end" bigint NOT NULL, alt character varying(510) NOT NULL, genotype character(3), read_depth integer, ref_count integer, alt_count integer, called boolean ); ALTER TABLE ONLY analysis.snv_genotypes ADD CONSTRAINT genotype_pk PRIMARY KEY (aliquot_barcode, chrom, start, "end", alt); CREATE INDEX called_ind ON analysis.snv_genotypes USING btree (called); CREATE INDEX genotype_pos_ind ON analysis.snv_genotypes USING btree (chrom, start, "end", alt); CREATE MATERIALIZED VIEW analysis.var_case_aliquot AS SELECT var_case_aliquot.aliquot_barcode, var_case_aliquot.case_barcode, var_case_aliquot.chrom, var_case_aliquot.start, var_case_aliquot."end", var_case_aliquot.alt, var_case_aliquot.gene_symbol, var_case_aliquot.variant_classification, var_case_aliquot.variant_type, var_case_aliquot.hgvs_p, var_case_aliquot.polyphen, var_case_aliquot.sift FROM var_case_aliquot WITH NO DATA; CREATE INDEX var_case_aliquot_aliquot_ind ON analysis.var_case_aliquot USING btree (aliquot_barcode); CREATE INDEX var_case_aliquot_pos_ind ON analysis.var_case_aliquot USING btree (chrom, start, "end", alt);
More extensive DDL here: https://rextester.com/JRJH43442
UPDATE 12/13:
To clarify, I am using Postgres 10.5 on CentOS 7.3 w/ 16 cores and 32 GB mem. The query has now been running for 24+ hrs without any result.
Checking on the status it seems that wait_event_type
is IO
. Does this mean the query is scratching/writing to scratch space? Could this explain the slowness?
+------------------+---------------+---------------+---------------+---------------+-----------------+--------------+--------+-------------+--------------+ | application_name | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | +------------------+---------------+---------------+---------------+---------------+-----------------+--------------+--------+-------------+--------------+ | psql | 12/12/18 8:42 | 12/12/18 8:42 | 12/12/18 8:42 | 12/12/18 8:42 | IO | DataFileRead | active | 22135 | 22135 | +------------------+---------------+---------------+---------------+---------------+-----------------+--------------+--------+-------------+--------------+
I have a lot of available resources:
$ free -h total used free shared buff/cache available Mem: 31G 722M 210M 5.0G 30G 25G Swap: 3.7G 626M 3.1G
I guess making more memory available could help? Is there some way to optimize queries that need more memory than available to them?
Advertisement
Answer
From this post’s comment:
Your query is using
genotype_pos_ind
and filtering onaliquot_barcode
. Try deleting (temporarly)genotype_pos_ind
and if that doesn’t work, search how to force index usage.
Your query should be using genotype_pk
instead.
From what you said, there might be a lot of record with the same values for aliquot_barcode
, chrom
, start
and end
, so the RDBMS will then take a long time to filter every aliquot_barcode
.
And if it’s still too long for you, you can try my older answer, which I’ll keep for further references:
Unfortunately, I won’t ba able to optimize your query: there is too much things to take into account. Building a result with 9 millions record of 13 fields might be too much: swapping might occur, your OS won’t allow so much memory allocation, while also making (writtent before the real answer…)JOIN
, etc..
I used to optimize some query consisting of fifteen table of around 10 millions records. SELECT
of this size would never be doable in reasonable time (less than 10 hours).
I don’t have any RDBMS to test what I’m saying. Also, I haven’t done any SQL for half a year :p Finding why this is taking so much time (as you asked) will be too much time consumming, so here is another solution to the original problem.
The solution I adopted was making temporary table:
- Create the temporary table:
tmp_analysis
, with the same fields as yourSELECT
+ some utility fields:
An ID field (tmp_ID
, a big int), a boolean to check if record has been updated (tmp_updated), and timestamp to check when it has been updated (tmp_update_time
).
And of course all fields, with the same datatypes, from your original SELECT
(from vca
and gt
)
- Insert all your records from
vca
:
Use null
(or any other default value if you can’t) for fields from gt
for the moment. Set tmp_updated
to false. Use a simple count()
for the primary key.
- Update all these records with fields from gt.
Use a WHERE
rather than a JOIN
:
UPDATE tmp_analysis as tmp -- I don't think you need to use a schema to call tmp_analysis SET tmp_update = true, tmp_update_time = clock_timestamp(), tmp.mutect2_call = gt.called gt.ref_count, gt.alt_count, gt.read_depth, gt.called = -- ... (your CASE/WHEN/ELSE/END should work here) FROM analysis.snv_genotypes gt WHERE --JOIN should work too tmp.aliquot_barcode = gt.aliquot_barcode AND tmp.chrom = gt.chrom AND vca.start = gt.start AND tmp."end" = gt."end" AND tmp.alt::text = gt.alt::text
I said that you should use EXISTS
for performance reasons, but I was mistaken as I don’t think you can retreive fields from inside the EXISTS
condition. There might be a way to tell Postgresql that it’s a one to one relationship, but I’m not sure. Anyway, index
- Obviously,
SELECT
yourtmp_analysis
table to get your records !
Some notes for this:
- If it’s taking too much time:
Use the tmp_ID
field to limit the number of update to 10 000 for example and check the execution plan of the 3rd query (UPDATE
): You should have a full scan on the temporary table table and an index scan on gt
(on genotype_pk
). If not, check your indexes and search how to force index use by PGSL. You should use WHERE tmp_ID < 10000
rather than LIMIT 10000
. IIRC, LIMIT
will execute the whole query and just give you part of the result.
- If it’s still taking too much time:
Segment the query using tmp_ID
and (as you said) use a loop statement on the UPDATE
to query with 100 000 or less records at once (again, use where tmp_ID < x AND tmp_ID > y
). Check the execution plan again: the full scan should be limited by the tmp_id
before the index scan. Don’t forger to add an index on this fild (if it’s not already the primary key).
- If you need to call this again later:
Use BEGIN/END TRANSACTION
to encapsulate all the queries, and the TEMPORARY TABLE
option on CREATE TABLE tmp_analysis
so that you won’t have to clean tmp_analysis after executing the query.
- If you still have a problem with loops:
Use transactions inside the loop, and stop it if it freezes again. Then you can restore it later with a smaller loop size.
- If you want to reduce a little bit the execution time:
You can do step 1 and 2 in one query with a INSERT .. AS .. SELECT
, but I don’t remember how to set datatype for fields from gt
, as they’ll be set to null. Normally, this should be a little bit faster as a whole.
- If you’re curious:
And the query without the loop still takes more than 10 hours, stop it and checks the tmp_update_time to see how execution times evolves, maybe it’ll give you a clue about why the original query didn’t worked. There are multiple configuration options on PGSQL to limit RAM usage, disk usage, threads. Your OS might put it’s own limits, and check disk swapping, CPU cache usage, etc. (I think you’ve already done some of this but I didn’t check)