I’m trying to determine if there a “low cost” optimization for the following query. We’ve implemented a system whereby ‘tickets’ earn ‘points’ and thus can be ranked. In order to support analytical type of queries, we store the rank of every ticket (tickets can be tied) with the ticket.
I’ve found that, at scale, updating this rank is very slow. I’m attempting to run the scenario below on a set of “tickets” that is about 20k tickets big.
I’m hoping that someone can help identify why and offer some help.
We’re on postgres 9.3.6
Here’s a simplified ticket table schema:
ogs_1=> d api_ticket Table "public.api_ticket" Column | Type | Modifiers ------------------------------+--------------------------+--------------------------------------------------------- id | integer | not null default nextval('api_ticket_id_seq'::regclass) status | character varying(3) | not null points_earned | integer | not null rank | integer | not null event_id | integer | not null user_id | integer | not null Indexes: "api_ticket_pkey" PRIMARY KEY, btree (id) "api_ticket_4437cfac" btree (event_id) "api_ticket_e8701ad4" btree (user_id) "api_ticket_points_earned_idx" btree (points_earned) "api_ticket_rank_idx" btree ("rank") Foreign-key constraints: "api_ticket_event_id_598c97289edc0e3e_fk_api_event_id" FOREIGN KEY (event_id) REFERENCES api_event(id) DEFERRABLE INITIALLY DEFERRED (user_id) REFERENCES auth_user(id) DEFERRABLE INITIALLY DEFERRED
Here’s the query that I’m executing:
UPDATE api_ticket t SET rank = ( SELECT rank FROM (SELECT Rank() over ( Partition BY event_id ORDER BY points_earned DESC ) as rank, id FROM api_ticket tt WHERE event_id = t.event_id AND tt.status != 'x' ) as r WHERE r.id = t.id ) WHERE event_id = <EVENT_ID> AND t.status != 'x';
Here’s the explain on a set of about 10k rows:
Update on api_ticket t (cost=0.00..1852176.70 rows=9646 width=88) (actual time=1254035.623..1254035.623 rows=0 loops=1) -> Seq Scan on api_ticket t (cost=0.00..1852176.70 rows=9646 width=88) (actual time=121.611..1253148.416 rows=9748 loops=1) Filter: (((status)::text <> 'x'::text) AND (event_id = 207)) Rows Removed by Filter: 10 SubPlan 1 -> Subquery Scan on r (cost=159.78..191.97 rows=1 width=8) (actual time=87.466..128.537 rows=1 loops=9748) Filter: (r.id = t.id) Rows Removed by Filter: 9747 -> WindowAgg (cost=159.78..178.55 rows=1073 width=12) (actual time=46.389..108.954 rows=9748 loops=9748) -> Sort (cost=159.78..162.46 rows=1073 width=12) (actual time=46.370..66.163 rows=9748 loops=9748) Sort Key: tt.points_earned Sort Method: quicksort Memory: 799kB -> Index Scan using api_ticket_4437cfac on api_ticket tt (cost=0.29..105.77 rows=1073 width=12) (actual time=2.698..26.448 rows=9748 loops=9748) Index Cond: (event_id = t.event_id) Filter: ((status)::text <> 'x'::text) Total runtime: 1254036.583 ms
Advertisement
Answer
The correlated subquery has to be executed for every row (20k times in your example). This only makes sense for a small number of rows or where the computation requires it.
This derived table is computed once before we join to it:
UPDATE api_ticket t SET rank = tt.rnk FROM ( SELECT tt.id , rank() OVER (PARTITION BY tt.event_id ORDER BY tt.points_earned DESC) AS rnk FROM api_ticket tt WHERE tt.status <> 'x' AND tt.event_id = <EVENT_ID> ) tt WHERE t.id = tt.id AND t.rank <> tt.rnk; -- avoid empty updates
Should be quite a bit faster. 🙂
Additional improvements
The last predicate rules out empty updates:
Only makes sense if the new rank can be the old rank at least occasionally. Else remove it.
We don’t need to repeat AND t.status != 'x'
in the outer query since we join on the PK column id
it’s the same value on both sides.
And the standard SQL inequality operator is <>
, even if Postgres supports !=
, too.
Push down the predicate event_id = <EVENT_ID>
into the subquery as well. No need to compute numbers for any other event_id
. This was handed down from the outer query in your original. In the rewritten query, we best apply it in the subquery altogether. Since we use PARTITION BY tt.event_id
, that’s not going to mess with ranks.