I am looking to aggregate IDs from a table by chunks from their reference from an other table. It is a bit hard to summary my problem so I will give an example:
I have two tables, the table Box and the table Item.
CREATE TABLE box( id bigint NOT NULL, label varchar, CONSTRAINT box_pk PRIMARY KEY (id)); CREATE TABLE item( id bigint NOT NULL, box bigint NOT NULL, label varchar, CONSTRAINT item_pk PRIMARY KEY (id), CONSTRAINT box_fk FOREIGN KEY (box) REFERENCES box(id));
There is a many to one reference between them, a box can contains many items, and an item cannot exist without a box.
Currently there are a lot of boxes (> 100,000) and items (> 600,000), and even if most of boxes have around 10 items, a significant amount have more than 1,000 items.
I need to do a specific process on the items, where I have to compare an item to all other items from the same box (with a Java code). In order to avoid selecting a lot of items as once, I want to try to regroup all boxes IDs in a single cell (separated with a coma) that satisfies a certain chunk size, this chunk equate to the max amount of items for this group of boxes.
The only thing that I managed to do is a request counting the amount of items by boxes:
SELECT b.id, count(i.*) as items FROM box b LEFT JOIN item i ON i.box = b.id WHERE i.box IS NOT NULL GROUP BY b.id ORDER BY items DESC id | items 3834 | 7206 78350| 6151 73525| 5996 3838 | 5192 71331| 5184 76842| 3982 76854| 3982 ...
The result I want would look like this, if I set a chunk of items to 15000 for example. id_group would be a text column.
id_group | total_amount 3834,78350 | 13357 73525,3838 | 11188 71331,76842,76854 | 13148
At the beginning there will not be a lot of IDs, but with less items in the latter boxes, there will be more and more IDs in each cell to reach the chunk limit, and it is what I want! If for any reason there is a box containing more items than the chunk limit, then it would just return this single ID in the cell. I don’t need the total_amount though, I just need the IDs of the boxes joined by a comma, then I will be able to do my processes.
Is there a way with postgreSQL to do this?
Advertisement
Answer
You can implement a greedy algorithm for combining the boxes using a recursive CTE:
with recursive b as ( select b.id, count(*) as items, row_number() over (order by count(*), b.id) as seqnum from box b join item i on i.box = b.id group by b.id ), cte as ( select b.id::text as ids, b.items as items, 1 as grp, 1 as seqnum from b where seqnum = 1 union all select (case when b.items + cte.items < 15000 then cte.ids || ',' || b.id else b.id::text end) as ids, (case when b.items + cte.items < 15000 then cte.items + b.items else b.items end) as items, (case when b.items + cte.items < 15000 then cte.grp else cte.grp + 1 end) as grp, b.seqnum from cte join b on b.seqnum = cte.seqnum + 1 ) select distinct on (grp) cte.* from cte order by grp, seqnum desc;
Here is a db<>fiddle.