Skip to content
Advertisement

Managing multiple categories trees, using Python and PostgreSQL

I have multiple categories, which can have None or one or multiple sub-categories.

The process theoretically can go to infinite. So, it is like having multiple trees.

Tree example.

A
 - A1
     - A11
     - A12
-A2
B
C
 - C1

I have also Item(s). An Item can be in multiple categories.

At this moment to connect the categories, in database I use three fields:

  • children (the children of a category),

  • path([1,4,8], basically are the ids of grandparent, parent, and category itself)

  • depth, represent the level in the tree for each category

Using this fields I avoid some recursive and using more queries.

I usually retrieve data like:

  • top categories (depth 0)

  • subcategories of a category

  • sibling categories

  • items in the category (for example a grandparent category, will show its direct items, the items of children, and the items of grandchildren)

At this moment I’m using Django(want to move to FastAPI) and PostgreSQL, and every time I have CRUD operations on categories, the three fields (path,depth,children)will be modified.

I’m thinking maybe is a better way, to maintain/retrieve the categories tree and corresponding items.

Advertisement

Answer

There are various possible strategies to store a tree in a database.

Storing full paths in an array as you currently is one of them. But with this solution is is hard to enforce referential integrity (how do you guarantee that these ids in the arrays really exist in the table?), and simple tree operations are tedious (how do you enumerate the direct children of given node?).

The answer by @VesaKarjalainen suggests using the adjacency list model, a separate table where each element refers to its immediate ancestor. It works, but has downsides: typically, is that complicated to traverse the hierarchy (like get all children or parents of a given node): you need some kind of iteration or recursion for this, which SQL engines do not do efficiently.

I would recommend the closure table approach. This works by creating a separate table that stores all possible paths in the tree, like so:

create table category_path (
    parent_id int,
    child_id int,
    level int,
    primary key(parent_id, child_id),
    foreign key(parent_id) references category(id),
    foreign key(parent_id) references category(id)
);

For this tree structure that you provided:

        A       B     C 
       /             |
     A1   A2          C1
     /
  A11  A12

You would store the following data:

parent_id    child_id    level
A            A           0
A            A1          1
A            A2          1
A            A11         2
A            A12         2
A1           A11         1
A1           A12         1
B            B           0
C            C           0
C            C1          1

Now, say you want to retrieve all children of a given category, that’s as simple as:

select * from category_path where parent_id = 'A'

To get all the parents, you would just replace where parent_id = ... with where child_id = ....

You can bring in the main table with a join:

select c.*
from category_path cp
inner join categories c on c.id = cp.parent_id
where cp.parent_id = 'A'
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement