I’m trying to diagnose a slow query I have against a materialized view in Postgresql (10.12). In brief, I have two tables activity
and task
plus a materialized view, activity_status_view
that is essentially a subset of the rows of activity
plus a few extra columns added in that describe the “status” of the activity in the system.
Running a simple query joining the two tables is fast, but the same query joining the view with the task
table is unreasonably slow. The query that builds the materialized view is big and ugly, but my understanding was that it should behave exactly the same as a normal table for something like this. Another observation is that “refreshing” the materialized view in question is also very fast (less than a second).
I have dumped out the query explanation + table details (with some irrelevant columns removed). Can anyone explain the discrepancy? Could the query that builds the materialized view be relevant here?
> EXPLAIN ANALYSE SELECT * FROM task t JOIN activity_status_view a ON t.args->>'activityId' = a.id::text LIMIT 1 +-----------------------------------------------------------------------------------------------------------------------------------------------+ | QUERY PLAN | |-----------------------------------------------------------------------------------------------------------------------------------------------| | Limit (cost=0.00..7.43 rows=1 width=861) (actual time=1866719.779..1866719.779 rows=0 loops=1) | | -> Nested Loop (cost=0.00..187536570.93 rows=25235882 width=861) (actual time=1866719.777..1866719.777 rows=0 loops=1) | | Join Filter: ((t.args ->> 'activityId'::text) = (a.id)::text) | | Rows Removed by Join Filter: 5047176465 | | -> Seq Scan on task t (cost=0.00..9197.05 rows=85905 width=766) (actual time=0.013..95.792 rows=85905 loops=1) | | -> Materialize (cost=0.00..2706.30 rows=58753 width=95) (actual time=0.002..4.804 rows=58753 loops=85905) | | -> Seq Scan on activity_status_view a (cost=0.00..1551.53 rows=58753 width=95) (actual time=0.008..5.978 rows=58753 loops=1) | | Planning time: 0.231 ms | | Execution time: 1866720.757 ms | +-----------------------------------------------------------------------------------------------------------------------------------------------+ EXPLAIN Time: 1866.740s (31 minutes), executed in: 1866.722s (31 minutes) > EXPLAIN ANALYSE SELECT * FROM task t JOIN activity a ON t.args->>'activityId' = a.id::text LIMIT 1 +------------------------------------------------------------------------------------------------------------------------------------+ | QUERY PLAN | |------------------------------------------------------------------------------------------------------------------------------------| | Limit (cost=0.00..6.45 rows=1 width=819) (actual time=14.541..14.541 rows=1 loops=1) | | -> Nested Loop (cost=0.00..293143952.80 rows=45420551 width=819) (actual time=14.540..14.540 rows=1 loops=1) | | Join Filter: ((t.args ->> 'activityId'::text) = (a.id)::text) | | Rows Removed by Join Filter: 14753 | | -> Seq Scan on task t (cost=0.00..9197.05 rows=85905 width=766) (actual time=0.010..0.010 rows=1 loops=1) | | -> Materialize (cost=0.00..3710.19 rows=105746 width=53) (actual time=0.007..5.481 rows=14754 loops=1) | | -> Seq Scan on activity a (cost=0.00..2148.46 rows=105746 width=53) (actual time=0.005..1.488 rows=14754 loops=1) | | Planning time: 0.151 ms | | Execution time: 14.835 ms | +------------------------------------------------------------------------------------------------------------------------------------+ EXPLAIN Time: 0.036s > SELECT count(*) FROM activity_status_view +---------+ | count | |---------| | 58753 | +---------+ SELECT 1 Time: 0.021s > SELECT count(*) FROM activity +---------+ | count | |---------| | 105746 | +---------+ SELECT 1 Time: 0.036s
> d activity +--------------------------------------+-----------------------------+-------------+ | Column | Type | Modifiers | |--------------------------------------+-----------------------------+-------------| | id | uuid | not null | | frame_range | int4range | not null | +--------------------------------------+-----------------------------+-------------+ Indexes: "pk_activity" PRIMARY KEY, btree (id) Referenced by: TABLE "activity_status_change" CONSTRAINT "fk_activity_status_change_activity_id_activity" FOREIGN KEY (activity_id) REFERENCES activity(id) ON DELETE CASCADE
> d activity_status_view +--------------------------------------+-----------------------------+-------------+ | Column | Type | Modifiers | |--------------------------------------+-----------------------------+-------------| | id | uuid | | | frame_range | int4range | | | video_activity_type_id | uuid | | | status_change_id | uuid | | | version | integer | | | status | character varying | | +--------------------------------------+-----------------------------+-------------+ Indexes: "uq_activity_status_view_id" UNIQUE, btree (id)
> d task +--------------------------------+-----------------------------+-------------+ | Column | Type | Modifiers | |--------------------------------+-----------------------------+-------------| | id | uuid | not null | | args | jsonb | not null | +--------------------------------+-----------------------------+-------------+ Indexes: "pk_task" PRIMARY KEY, btree (id) TABLE "activity_status_change" CONSTRAINT "fk_activity_status_change_task_id_task" FOREIGN KEY (task_id) REFERENCES task(id) ON DELETE SET NULL
The following is equally slow:
SELECT * FROM task t JOIN activity a ON t.args->>'activityId' = a.id::text WHERE a.id IN (SELECT id FROM activity_status_view) LIMIT 1
Advertisement
Answer
activity_status_view that is essentially a subset of the rows of activity
That subset of the rows it chooses has excluded all the ones that actually match the join condition. So the LIMIT never kicks in and lets the query terminate early. Given that fact, the better plan would be to do a hash join, or use an index on activity_status_view.id
, or on task.args ->> 'activityId'
(you could try creating the last of those), but if the statistics are way off it might not realize that.
Because the distribution of values of ->>’activityId’ is not visible to the planner (being inside a JSONB), it can’t know how often those values might intersect. Creating the expression index can help it figure that out, (run ANALYZE on the table after creating the index), so it might use the statistics from the index to solve the planning problem, without actually using the index in the plan.
Is it just a coincidence that the MV query excludes all the matching rows, or is that by design?
Did you VACUUM ANALYZE the materialized view after refreshing it?
It is unlikely that this has anything to do with the fact that it is a materialized view. If you made it be a free-standing table (CREATE TABLE whatever AS SELECT…) or a regular nonmaterialized view, you would probably have the same problem.