I am looking to aggregate IDs from a table by chunks from their reference from an other table. It is a bit hard to summary my problem so I will give an example:
I have two tables, the table Box and the table Item.
CREATE TABLE box(
id bigint NOT NULL,
label varchar,
CONSTRAINT box_pk PRIMARY KEY (id));
CREATE TABLE item(
id bigint NOT NULL,
box bigint NOT NULL,
label varchar,
CONSTRAINT item_pk PRIMARY KEY (id),
CONSTRAINT box_fk FOREIGN KEY (box) REFERENCES box(id));
There is a many to one reference between them, a box can contains many items, and an item cannot exist without a box.
Currently there are a lot of boxes (> 100,000) and items (> 600,000), and even if most of boxes have around 10 items, a significant amount have more than 1,000 items.
I need to do a specific process on the items, where I have to compare an item to all other items from the same box (with a Java code). In order to avoid selecting a lot of items as once, I want to try to regroup all boxes IDs in a single cell (separated with a coma) that satisfies a certain chunk size, this chunk equate to the max amount of items for this group of boxes.
The only thing that I managed to do is a request counting the amount of items by boxes:
SELECT b.id, count(i.*) as items
FROM box b LEFT JOIN item i ON i.box = b.id
WHERE i.box IS NOT NULL
GROUP BY b.id
ORDER BY items DESC
id | items
3834 | 7206
78350| 6151
73525| 5996
3838 | 5192
71331| 5184
76842| 3982
76854| 3982
The result I want would look like this, if I set a chunk of items to 15000 for example. id_group would be a text column.
id_group | total_amount
3834,78350 | 13357
73525,3838 | 11188
71331,76842,76854 | 13148
At the beginning there will not be a lot of IDs, but with less items in the latter boxes, there will be more and more IDs in each cell to reach the chunk limit, and it is what I want! If for any reason there is a box containing more items than the chunk limit, then it would just return this single ID in the cell. I don’t need the total_amount though, I just need the IDs of the boxes joined by a comma, then I will be able to do my processes.
Is there a way with postgreSQL to do this?
Advertisement
Answer
You can implement a greedy algorithm for combining the boxes using a recursive CTE:
with recursive b as (
select b.id, count(*) as items,
row_number() over (order by count(*), b.id) as seqnum
from box b join
item i
on i.box = b.id
group by b.id
),
cte as (
select b.id::text as ids, b.items as items, 1 as grp, 1 as seqnum
from b
where seqnum = 1
union all
select (case when b.items + cte.items < 15000
then cte.ids || ',' || b.id
else b.id::text
end) as ids,
(case when b.items + cte.items < 15000
then cte.items + b.items
else b.items
end) as items,
(case when b.items + cte.items < 15000
then cte.grp
else cte.grp + 1
end) as grp,
b.seqnum
from cte join
b
on b.seqnum = cte.seqnum + 1
)
select distinct on (grp) cte.*
from cte
order by grp, seqnum desc;
Here is a db<>fiddle.