Skip to content
Advertisement

Group objects by a sum of their references with a max limit

I am looking to aggregate IDs from a table by chunks from their reference from an other table. It is a bit hard to summary my problem so I will give an example:

I have two tables, the table Box and the table Item.

CREATE TABLE box(
id bigint NOT NULL,
label varchar,
CONSTRAINT box_pk PRIMARY KEY (id));

CREATE TABLE item(
id bigint NOT NULL,
box bigint NOT NULL,
label varchar,
CONSTRAINT item_pk PRIMARY KEY (id),
CONSTRAINT box_fk FOREIGN KEY (box) REFERENCES box(id));

There is a many to one reference between them, a box can contains many items, and an item cannot exist without a box.

Currently there are a lot of boxes (> 100,000) and items (> 600,000), and even if most of boxes have around 10 items, a significant amount have more than 1,000 items.

I need to do a specific process on the items, where I have to compare an item to all other items from the same box (with a Java code). In order to avoid selecting a lot of items as once, I want to try to regroup all boxes IDs in a single cell (separated with a coma) that satisfies a certain chunk size, this chunk equate to the max amount of items for this group of boxes.

The only thing that I managed to do is a request counting the amount of items by boxes:

SELECT b.id, count(i.*) as items 
FROM box b LEFT JOIN item i ON i.box = b.id 
WHERE i.box IS NOT NULL 
GROUP BY b.id 
ORDER BY items DESC

id   | items
3834 | 7206
78350| 6151
73525| 5996
3838 | 5192
71331| 5184
76842| 3982
76854| 3982
...

The result I want would look like this, if I set a chunk of items to 15000 for example. id_group would be a text column.

id_group          | total_amount
3834,78350        | 13357
73525,3838        | 11188
71331,76842,76854 | 13148

At the beginning there will not be a lot of IDs, but with less items in the latter boxes, there will be more and more IDs in each cell to reach the chunk limit, and it is what I want! If for any reason there is a box containing more items than the chunk limit, then it would just return this single ID in the cell. I don’t need the total_amount though, I just need the IDs of the boxes joined by a comma, then I will be able to do my processes.

Is there a way with postgreSQL to do this?

Advertisement

Answer

You can implement a greedy algorithm for combining the boxes using a recursive CTE:

with recursive b as (
      select b.id, count(*) as items,
             row_number() over (order by count(*), b.id) as seqnum
      from box b join
           item i 
           on i.box = b.id 
      group by b.id 
     ),
     cte as (
      select b.id::text as ids, b.items as items, 1 as grp, 1 as seqnum
      from b
      where seqnum = 1
      union all
      select (case when b.items + cte.items < 15000
                   then cte.ids || ',' || b.id
                   else b.id::text
              end) as ids,
             (case when b.items + cte.items < 15000
                   then cte.items + b.items
                   else b.items
              end) as items,
             (case when b.items + cte.items < 15000
                   then cte.grp
                   else cte.grp + 1
              end) as grp,
             b.seqnum
      from cte join
           b
           on b.seqnum = cte.seqnum + 1
     )
select distinct on (grp) cte.*
from cte
order by grp, seqnum desc;

Here is a db<>fiddle.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement