How can I replace this correlated subquery within a function call?

Given the following tables

buckets

metric_id|start_date         |bucket
------------------------------------
a        |2019-12-05 00:00:00|1
a        |2019-12-06 00:00:00|2
b        |2021-10-31 00:00:00|1
b        |2021-11-01 00:00:00|2

​x
 
metric_id|start_date         |bucket------------------------------------a        |2019-12-05 00:00:00|1a        |2019-12-06 00:00:00|2b        |2021-10-31 00:00:00|1b        |2021-11-01 00:00:00|2​

points

point_id|metric_id|timestamp
----------------------------
1       |a        |2019-12-05 00:00:00
2       |a        |2019-12-06 00:00:00
3       |b        |2021-10-31 00:00:00
4       |b        |2021-11-01 00:00:00

 
point_id|metric_id|timestamp----------------------------1       |a        |2019-12-05 00:00:002       |a        |2019-12-06 00:00:003       |b        |2021-10-31 00:00:004       |b        |2021-11-01 00:00:00​

And the following query

select
       p.metric_id,
       bucket
from points p
left join width_bucket(p.timestamp, (select array(select start_date
                                                  from buckets b
                                                  where b.metric_id = p.metric_id -- correlated sub-query
                                                  ))) as bucket on true

 
select       p.metric_id,       bucketfrom points pleft join width_bucket(p.timestamp, (select array(select start_date                                                  from buckets b                                                  where b.metric_id = p.metric_id -- correlated sub-query                                                  ))) as bucket on true​

Output

metric_id|bucket
-----------------
a        |1
a        |2
b        |1
b        |2

 
metric_id|bucket-----------------a        |1a        |2b        |1b        |2​

How can I remove the correlated sub-query to improve the performance?

Currently ~280,000 points * ~650 buckets = ~180,000,000 loops = very slow!

Basically I want to remove the correlated sub-query and apply the width_bucket function only once per unique metric_id in buckets, so that the performance is improved and the function is still given the correct time series data.

How can this be done in Postgres 13?

Answer

You can use a cte to aggregate buckets first

with buckets_arr as (
   select  metric_id, array_agg(start_date order by start_date) arrb
   from buckets
   group by metric_id
)
select
       p.metric_id,
       width_bucket(p.timestamp, ba.arrb) bucket
from points p
join buckets_arr ba on p.metric_id = ba.metric_id

 
with buckets_arr as (   select  metric_id, array_agg(start_date order by start_date) arrb   from buckets   group by metric_id)select       p.metric_id,       width_bucket(p.timestamp, ba.arrb) bucketfrom points pjoin buckets_arr ba on p.metric_id = ba.metric_id​

Advertisement

Answer