we have a query to get all jobs with changes inside a certain time period. Depending on the selected period the performance goes from <100 milliseconds for a day to ~7 seconds for a week.
I found that if the time period is small enough, the index is used and the query is fast. If the period gets too big, the index is not used and the query becomes slow.
The server runs with version 9.2
.
Why is this caused and how to fix this issue?
Create script:
CREATE TABLE IF NOT EXISTS "Job" ( "id" serial PRIMARY KEY, "serial" TEXT NOT NULL ); CREATE UNIQUE INDEX "index_Job_serial" ON "Job" ("serial" ASC); CREATE TABLE IF NOT EXISTS "Property" ( "id" serial PRIMARY KEY, "name" TEXT NOT NULL ); CREATE TABLE IF NOT EXISTS "Timestamp" ( "id" serial PRIMARY KEY, "usSince1970" BIGINT NOT NULL , "localTime" TEXT ); CREATE INDEX "index_Timestamp_usSince1970" ON "Timestamp" USING btree ("usSince1970"); CREATE TABLE IF NOT EXISTS "Changes" ( "idJob" INTEGER NOT NULL , "idProperty" INTEGER NOT NULL , "idTimestamp" INTEGER NOT NULL , "value1" decimal(25,5), "value2" INTEGER , "value3" TEXT , PRIMARY KEY ("idJob", "idProperty", "idTimestamp") , FOREIGN KEY ("idJob" ) REFERENCES "Job" ("id" ) , FOREIGN KEY ("idProperty" ) REFERENCES "Property" ("id" ) , FOREIGN KEY ("idTimestamp" ) REFERENCES "Timestamp" ("id" ) ); CREATE INDEX "index_Changes_idJob" ON "Changes" ("idJob" ASC); CREATE INDEX "index_Changes_idProperty" ON "Changes" ("idProperty" ASC); CREATE INDEX "index_Changes_idTimestamp" ON "Changes" ("idTimestamp" DESC);
Fast query:
-- fast query (1 day) SELECT DISTINCT "idJob" FROM "Changes" INNER JOIN "Timestamp" ON "Timestamp"."id" = "Changes"."idTimestamp" WHERE "Timestamp"."usSince1970" between 1584831600000000 and 1584745200000000 -- explain HashAggregate (cost=26383.48..26444.33 rows=6085 width=4) (actual time=8.039..8.078 rows=179 loops=1) -> Nested Loop (cost=0.00..26368.26 rows=6085 width=4) (actual time=0.031..7.059 rows=6498 loops=1) -> Index Scan using "index_Timestamp_usSince1970" on "Timestamp" (cost=0.00..96.25 rows=2510 width=4) (actual time=0.022..0.514 rows=2671 loops=1) Index Cond: (("usSince1970" >= 1584745200000000::bigint) AND ("usSince1970" <= 1584831600000000::bigint)) -> Index Scan using "index_Changes_idTimestamp" on "Changes" (cost=0.00..10.27 rows=20 width=8) (actual time=0.002..0.002 rows=2 loops=2671) Index Cond: ("idTimestamp" = "Timestamp".id) Total runtime: 8.204 ms
Slow query:
-- slow query (7 days) SELECT distinct "idJob" FROM "Changes" INNER JOIN "Timestamp" ON "Timestamp"."id" = "Changes"."idTimestamp" WHERE "Timestamp"."usSince1970" between 1583708400000000 and 1584313200000000 -- explain Unique (cost=570694.82..571824.16 rows=92521 width=4) (actual time=8869.569..8930.545 rows=3695 loops=1) -> Sort (cost=570694.82..571259.49 rows=225867 width=4) (actual time=8869.568..8915.372 rows=260705 loops=1) Sort Key: "Changes"."idJob" Sort Method: external merge Disk: 3552kB -> Hash Join (cost=4926.44..547518.97 rows=225867 width=4) (actual time=6325.494..8734.353 rows=260705 loops=1) Hash Cond: ("Changes"."idTimestamp" = "Timestamp".id) -> Seq Scan on "Changes" (cost=0.00..250722.43 rows=16238343 width=8) (actual time=0.004..2505.794 rows=16238343 loops=1) -> Hash (cost=3397.79..3397.79 rows=93172 width=4) (actual time=42.392..42.392 rows=107093 loops=1) Buckets: 4096 Batches: 4 Memory Usage: 948kB -> Index Scan using "index_Timestamp_usSince1970" on "Timestamp" (cost=0.00..3397.79 rows=93172 width=4) (actual time=0.006..20.831 rows=107093 loops=1) Index Cond: (("usSince1970" >= 1583708400000000::bigint) AND ("usSince1970" <= 1584313200000000::bigint)) Total runtime: 8932.374 ms
Thanks in advance.
Advertisement
Answer
The slow query processes way more data (100000 vs. 2500 rows from "Timestamp"
), so it is not surprising that it is slower.
You can force PostgreSQL to use a nested loop join with the slow query as well:
BEGIN; SET LOCAL enable_hashjoin = off; SET LOCAL enable_mergejoin = off; SELECT ...; COMMIT;
Try that and see if PostgreSQL was right and the hash join is really slower.
I suspect that PostgreSQL is doing the right thing here, and the best way for you to improve the performance would be to increase work_mem
.
If you are willing to add another index and to VACUUM
"Changes"
often enough, you could get even better performance with an index-only scan:
CREATE INDEX ON "Changes" ("idTimestamp") INCLUDE ("idJob");
On old versions of PostgreSQL that would be
CREATE INDEX ON "Changes" ("idTimestamp", "idJob");
Then you best drop the now unnecessary index "index_Changes_idTimestamp"
.
By the way, you are making your life unnecessarily hard by using camel case and quoted identifiers.