I have a fairly complex SQL query which first fetches some data into a CTE and then performs several self-joins on the CTE in order to compute a value. Here’s an abberivated exmaple, with some complexities of our application simplified:
WITH subset AS ( SELECT time, value, device_id FROM raw_data WHERE device_id IN (1, 2, 3) AND time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp ) SELECT time, (("device_1".value + "device_2".value) / "device_3".value) as value FROM ( SELECT * FROM subset WHERE device_id = 1 ) "device_1" INNER JOIN ( SELECT * FROM subset WHERE device_id = 2 ) "device_2" ON "device_1".time = "device_2".time INNER JOIN ( SELECT * FROM subset WHERE device_id = 3 ) "device_3" ON "device_3".time = "device_2".time
The query is auto-generated and can scale to a complex computation over the values of potentially tens of devices. For performance reasons, we would like to paginate the results of this query, since the time range used could be large. A key constraint is that the data could have gaps in time, but we want to return a constant number of rows per page.
We have considered using LIMIT per_page OFFSET start
at the end of the query which would be the standard approach, but this doesn’t buy us any speed up and the query performs the same. This makes sense, because in this case the LIMIT/OFFSET is performed after all the data has been fetched, joined, and computed and it just returns a slice of the data which is already computed. This doesn’t appreciably reduce how fast the query runs.
We have considered paginating the data fetched into the CTE, i.e. computing what time range corresponds to the page of interest, and then using that time range in the BETWEEN clause of the CTE. This would work, but the problem is we cannot reliably compute this time range, as some of the variables could have gaps in them. Therefore, if we compute 100 rows to be a window of 2 days and we fetch 2 days, there’s a possibility we get less than 100 rows if device_2 did not record data at some point in that window. For the computation, those data points would be dropped in the INNER JOINS.
The question is, is there an efficient way to paginate this query or restructure it to enable fast pagination, given these constraints? For example, is there some way to instruct the query planner to “join until you match 100 results matching the join conditions, and stop there”. We are running this on PostgreSQL, if that makes a difference.
Advertisement
Answer
1) Create a composite index with the following order device_id and time desc.
2) Try to generate a query in this way
select device_1.time, (("device_1".value + "device_2".value) / "device_3".value) as value from raw_data as device_1 ,raw_data as device_2 ,raw_data as device_3 where device_1.devise_id = 1 and device_2.devise_id = 2 and device_3.devise_id = 3 and device_1.time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp and device_2.time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp and device_3.time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp and device_1.time = device_2.time and device_2.time = device_3.time