Skip to content
Advertisement

Why does adding ORDER BY drastically speed up query?

I’ve discovered some very strange and counter-intuitive behaviour in PostgreSQL.

I have a query structure as follows. I am selecting both the IDs and the count from a subquery. The subquery does the filtering, joining, counting, but only orders by the IDs, since I’m using DISTINCT ON() to only get unique IDs.

The outer query then does the proper ordering, and any limits and offsets as needed. Here is an example of what the query structure looks like:

I’ve discovered something strange however. My database has several million entries, so overall queries aren’t the fastest. But when I remove the ORDER BY clause in the OUTER query, it drastically slows down the query time.

However, if I also remove the LIMIT clause, it becomes fast again, despite the fact that this example query is returning 800 000+ results.

In summary, for the outer query:

ORDER BY AND LIMIT – Fast

ONLY LIMIT – Very slow

ONLY ORDER BY – Fast, despite 800 000 results

NEITHER – Fast, despite 800 000 results

To give an idea of how much slower only having the LIMIT clause is, with both, neither, or just ORDER BY, the queries take no more than about 10 seconds.

With only the LIMIT clause however, the queries take about a minute 15, over 7 times as long!

You’d think that ORDER BY would instead slow things down, as it has to sort the results of the subquery, but it seems that isn’t the case. It’s very counter-intuitive.

If someone knows what’s going on behind the scenes here, I’d greatly appreciate them shedding some light on this.

Thanks

EDIT – Added execution plans for statements:

ORDER BY and LIMIT execution plan

Only LIMIT execution plan

Only ORDER BY execution plan

Advertisement

Answer

You didn’t post your execution plans, but I have my crystal ball ready, so I’ll have a guess at what’s going on.

In your second, very slow query the optimizer has a bright idea how to make it fast: It scans work_items using the index on id, fetches all the matching rows from work_item_states in a nested loop and filters out everything that does not match work_item_states.disposition = 'cancelled' until it has found 50 distinct result.

This is a good idea, but the optimizer does not know that all the rows with work_item_states.disposition = 'cancelled' match work_items with a high id, so it has to scan forever until it has found its 50 rows.

All the other queries don’t allow the planner to choose that strategy because it is only promising if a few rows in work_items.id order will do.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement