Skip to content
Advertisement

How can I efficiently paginate the results of a complex SQL query?

I have a fairly complex SQL query which first fetches some data into a CTE and then performs several self-joins on the CTE in order to compute a value. Here’s an abberivated exmaple, with some complexities of our application simplified:

WITH subset AS (
  SELECT time, value, device_id FROM raw_data
  WHERE device_id IN (1, 2, 3)
  AND time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp
)

SELECT
  time,
  (("device_1".value + "device_2".value) / "device_3".value) as value
FROM 

(
  SELECT * FROM subset 
  WHERE device_id = 1
) "device_1"

INNER JOIN
(
  SELECT * FROM subset 
  WHERE device_id = 2
) "device_2"
ON "device_1".time = "device_2".time

INNER JOIN
(
  SELECT * FROM subset 
  WHERE device_id = 3
) "device_3"
ON "device_3".time = "device_2".time

The query is auto-generated and can scale to a complex computation over the values of potentially tens of devices. For performance reasons, we would like to paginate the results of this query, since the time range used could be large. A key constraint is that the data could have gaps in time, but we want to return a constant number of rows per page.

We have considered using LIMIT per_page OFFSET start at the end of the query which would be the standard approach, but this doesn’t buy us any speed up and the query performs the same. This makes sense, because in this case the LIMIT/OFFSET is performed after all the data has been fetched, joined, and computed and it just returns a slice of the data which is already computed. This doesn’t appreciably reduce how fast the query runs.

We have considered paginating the data fetched into the CTE, i.e. computing what time range corresponds to the page of interest, and then using that time range in the BETWEEN clause of the CTE. This would work, but the problem is we cannot reliably compute this time range, as some of the variables could have gaps in them. Therefore, if we compute 100 rows to be a window of 2 days and we fetch 2 days, there’s a possibility we get less than 100 rows if device_2 did not record data at some point in that window. For the computation, those data points would be dropped in the INNER JOINS.

The question is, is there an efficient way to paginate this query or restructure it to enable fast pagination, given these constraints? For example, is there some way to instruct the query planner to “join until you match 100 results matching the join conditions, and stop there”. We are running this on PostgreSQL, if that makes a difference.

Advertisement

Answer

1) Create a composite index with the following order device_id and time desc.

2) Try to generate a query in this way

select device_1.time,
 (("device_1".value + "device_2".value) / "device_3".value) as value 
from raw_data as device_1 ,raw_data as device_2 ,raw_data as device_3 
where device_1.devise_id = 1 
and device_2.devise_id = 2 
and device_3.devise_id = 3 
and device_1.time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp 
and device_2.time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp 
and device_3.time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp 
and device_1.time = device_2.time 
and device_2.time = device_3.time
Advertisement