After five days of trying to solve the performance problem of our database in PostgreSQL, I decided to ask you a help! One week ago we have decided to try to move our database with 60M records from MSSQL to PostgreSQL and our SQL below is extremely slow on PostgreSQL.
set random_page_cost=1; set seq_page_cost=5; set enable_seqscan=on; set work_mem = '100MB'; SELECT DATE("DateStamp"), "Result", Count(*), Sum("ConversionCost") FROM "Log" WHERE "UserId" = 7841 AND "DateStamp" > '2019-01-01' AND "DateStamp" < '2020-02-26' GROUP BY 1,2
The execution plan
Finalize GroupAggregate (cost=1332160.59..1726394.02 rows=3093547 width=21) (actual time=2929.936..3157.049 rows=714 loops=1) " Output: (date(""DateStamp"")), ""Result"", count(*), sum(""ConversionCost"")" " Group Key: (date(""Log"".""DateStamp"")), ""Log"".""Result""" Buffers: shared hit=2292 read=345810 -> Gather Merge (cost=1332160.59..1661945.12 rows=2577956 width=21) (actual time=2929.783..3156.616 rows=2037 loops=1) " Output: (date(""DateStamp"")), ""Result"", (PARTIAL count(*)), (PARTIAL sum(""ConversionCost""))" Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=6172 read=857125 -> Partial GroupAggregate (cost=1331160.56..1363385.01 rows=1288978 width=21) (actual time=2906.450..3089.056 rows=679 loops=3) " Output: (date(""DateStamp"")), ""Result"", PARTIAL count(*), PARTIAL sum(""ConversionCost"")" " Group Key: (date(""Log"".""DateStamp"")), ""Log"".""Result""" Buffers: shared hit=6172 read=857125 Worker 0: actual time=2895.531..3058.852 rows=675 loops=1 Buffers: shared hit=1930 read=255687 Worker 1: actual time=2894.513..3052.916 rows=673 loops=1 Buffers: shared hit=1950 read=255628 -> Sort (cost=1331160.56..1334383.01 rows=1288978 width=9) (actual time=2906.435..2968.562 rows=1064916 loops=3) " Output: (date(""DateStamp"")), ""Result"", ""ConversionCost""" " Sort Key: (date(""Log"".""DateStamp"")), ""Log"".""Result""" Sort Method: quicksort Memory: 94807kB Worker 0: Sort Method: quicksort Memory: 69171kB Worker 1: Sort Method: quicksort Memory: 69063kB Buffers: shared hit=6172 read=857125 Worker 0: actual time=2895.518..2951.406 rows=951356 loops=1 Buffers: shared hit=1930 read=255687 Worker 1: actual time=2894.494..2947.892 rows=949038 loops=1 Buffers: shared hit=1950 read=255628 -> Parallel Index Scan using "IX_Log_UserId" on public."Log" (cost=0.56..1200343.50 rows=1288978 width=9) (actual time=0.087..2634.603 rows=1064916 loops=3) " Output: date(""DateStamp""), ""Result"", ""ConversionCost""" Index Cond: ("Log"."UserId" = 7841) Filter: (("Log"."DateStamp" > '2019-01-01 00:00:00'::timestamp without time zone) AND ("Log"."DateStamp" < '2020-02-26 00:00:00'::timestamp without time zone)) Buffers: shared hit=6144 read=857123 Worker 0: actual time=0.077..2653.065 rows=951356 loops=1 Buffers: shared hit=1917 read=255685 Worker 1: actual time=0.107..2654.640 rows=949038 loops=1 Buffers: shared hit=1935 read=255628 Planning Time: 0.330 ms Execution Time: 3163.850 ms
Execution plan URL https://explain.depesz.com/s/zLNI
The same SQL on MSSQL takes under 2 seconds but on PostgreSQL it takes even 10 seconds. The Log table contains about 60M records and "UserId" = 7841 AND "DateStamp" > '2019-01-01' AND "DateStamp" < '2020-02-26'
where clause filters about 3M records.
The Table structure is below
create table "Log" ( "Id" integer generated by default as identity constraint "PK_Log" primary key, "Result" boolean not null, "DateStamp" timestamp not null, "ConversionCost" integer not null, "UserId" integer not null constraint "FK_Log_User_UserId" references "User" on delete cascade, ); create index "IX_Log_ConversionCost" on "Log" ("ConversionCost"); create index "IX_Log_DateStamp" on "Log" ("DateStamp"); create index "IX_Log_Result" on "Log" ("Result"); create index "IX_Log_UserId" on "Log" ("UserId");
The PostgreSQL server is 6CPU and 16GB of ram server comparing to our old MSSQL 2CPU and 8GB of RAM, as you see PostgreSQL has more computing resources but performs much worse. Both servers have SSD.
Maybe the problem is that PostgreSQL is not so advanced in a performance like MS SQL and nothing can be done here?
Advertisement
Answer
You can rephrase the query as:
SELECT DATE("DateStamp"), "Result", Count(*), Sum("ConversionCost") FROM "Log" WHERE "UserId" = 7841 AND "DateStamp" >= '2019-01-02' AND "DateStamp" < '2020-02-26' GROUP BY 1,2
Then, the query would greatly benefit from the index:
create index "IX_Log_UserId" on "Log" ("UserId", "DateStamp"));
For further performance you can create a covering index:
create index "IX_Log_UserId" on "Log" ( "UserId", "DateStamp", "Result", "ConversionCost" );