Why does adding ORDER BY drastically speed up query?

I’ve discovered some very strange and counter-intuitive behaviour in PostgreSQL.

I have a query structure as follows. I am selecting both the IDs and the count from a subquery. The subquery does the filtering, joining, counting, but only orders by the IDs, since I’m using DISTINCT ON() to only get unique IDs.

The outer query then does the proper ordering, and any limits and offsets as needed. Here is an example of what the query structure looks like:

SELECT s.id, s.item_count
FROM (
    SELECT DISTINCT ON (work_items.id) work_items.id
        , work_item_states.disposition AS disposition
        , COUNT(work_items.id) OVER () AS item_count
    FROM work_items
    JOIN work_item_states ON work_item_states.work_item_refer = work_items.id
    WHERE work_item_states.disposition = 'cancelled'
    ORDER BY work_items.id
) AS s
ORDER BY s.disposition
LIMIT 50
OFFSET 0

I’ve discovered something strange however. My database has several million entries, so overall queries aren’t the fastest. But when I remove the ORDER BY clause in the OUTER query, it drastically slows down the query time.

However, if I also remove the LIMIT clause, it becomes fast again, despite the fact that this example query is returning 800 000+ results.

In summary, for the outer query:

ORDER BY AND LIMIT – Fast

...
) AS s
ORDER BY s.disposition
LIMIT 50
OFFSET 0

ONLY LIMIT – Very slow

...
) AS s
LIMIT 50
OFFSET 0

ONLY ORDER BY – Fast, despite 800 000 results

...
) AS s
ORDER BY s.disposition
OFFSET 0

NEITHER – Fast, despite 800 000 results

...
) AS s
OFFSET 0

To give an idea of how much slower only having the LIMIT clause is, with both, neither, or just ORDER BY, the queries take no more than about 10 seconds.

With only the LIMIT clause however, the queries take about a minute 15, over 7 times as long!

You’d think that ORDER BY would instead slow things down, as it has to sort the results of the subquery, but it seems that isn’t the case. It’s very counter-intuitive.

If someone knows what’s going on behind the scenes here, I’d greatly appreciate them shedding some light on this.

Thanks

EDIT – Added execution plans for statements:

ORDER BY and LIMIT execution plan

Limit  (cost=518486.52..518486.65 rows=50 width=53)
  ->  Sort  (cost=518486.52..520495.59 rows=803628 width=53)
        Sort Key: s.disposition
        ->  Subquery Scan on s  (cost=479736.16..491790.58 rows=803628 width=53)
              ->  Unique  (cost=479736.16..483754.30 rows=803628 width=53)
                    ->  Sort  (cost=479736.16..481745.23 rows=803628 width=53)
                          Sort Key: work_items.id
                          ->  WindowAgg  (cost=136262.98..345979.65 rows=803628 width=53)
                                ->  Hash Join  (cost=136262.98..335934.30 rows=803628 width=45)
                                      Hash Cond: (work_items.id = work_item_states.work_item_refer)
                                      ->  Seq Scan on work_items  (cost=0.00..106679.48 rows=4020148 width=37)
                                      ->  Hash  (cost=119152.97..119152.97 rows=803681 width=45)
                                            ->  Bitmap Heap Scan on work_item_states  (cost=18968.96..119152.97 rows=803681 width=45)
                                                  Recheck Cond: (disposition = 'cancelled'::text)
                                                  ->  Bitmap Index Scan on idx_work_item_states_disposition  (cost=0.00..18768.04 rows=803681 width=0)
                                                        Index Cond: (disposition = 'cancelled'::text)

Only LIMIT execution plan

Limit  (cost=1.11..69.52 rows=50 width=45)
  ->  Subquery Scan on s  (cost=1.11..1099599.17 rows=803628 width=45)
        ->  Unique  (cost=1.11..1091562.89 rows=803628 width=77)
              ->  WindowAgg  (cost=1.11..1089553.82 rows=803628 width=77)
                    ->  Merge Join  (cost=1.11..1079508.47 rows=803628 width=37)
                          Merge Cond: (work_items.id = work_item_states.work_item_refer)
                          ->  Index Only Scan using idx_work_items_id on work_items  (cost=0.56..477365.14 rows=4020148 width=37)
                          ->  Index Scan using idx_work_item_states_work_item_refer on work_item_states  (cost=0.56..582047.48 rows=803681 width=37)
                                Filter: (disposition = 'cancelled'::text)

Only ORDER BY execution plan

Sort  (cost=625547.09..627556.16 rows=803628 width=53)
  Sort Key: s.disposition
  ->  Subquery Scan on s  (cost=479736.16..491790.58 rows=803628 width=53)
        ->  Unique  (cost=479736.16..483754.30 rows=803628 width=53)
              ->  Sort  (cost=479736.16..481745.23 rows=803628 width=53)
                    Sort Key: work_items.id
                    ->  WindowAgg  (cost=136262.98..345979.65 rows=803628 width=53)
                          ->  Hash Join  (cost=136262.98..335934.30 rows=803628 width=45)
                                Hash Cond: (work_items.id = work_item_states.work_item_refer)
                                ->  Seq Scan on work_items  (cost=0.00..106679.48 rows=4020148 width=37)
                                ->  Hash  (cost=119152.97..119152.97 rows=803681 width=45)
                                      ->  Bitmap Heap Scan on work_item_states  (cost=18968.96..119152.97 rows=803681 width=45)
                                            Recheck Cond: (disposition = 'cancelled'::text)
                                            ->  Bitmap Index Scan on idx_work_item_states_disposition  (cost=0.00..18768.04 rows=803681 width=0)
                                                  Index Cond: (disposition = 'cancelled'::text)

Answer

You didn’t post your execution plans, but I have my crystal ball ready, so I’ll have a guess at what’s going on.

In your second, very slow query the optimizer has a bright idea how to make it fast: It scans work_items using the index on id, fetches all the matching rows from work_item_states in a nested loop and filters out everything that does not match work_item_states.disposition = 'cancelled' until it has found 50 distinct result.

This is a good idea, but the optimizer does not know that all the rows with work_item_states.disposition = 'cancelled' match work_items with a high id, so it has to scan forever until it has found its 50 rows.

All the other queries don’t allow the planner to choose that strategy because it is only promising if a few rows in work_items.id order will do.

Advertisement

Answer