Best practice for scaling SQL queries on joins?

I’m writing a REST api that works with SQL and am constantly finding myself in similar situations to this one, where I need to return lists of objects with nested lists inside each object by querying over table joins.

Let’s say I have a many-to-many relationship between Users and Groups. I have a User table and a Group table and a junction table UserGroup between them. Now I want to write a REST endpoint that returns a list of users, and for each user the groups that they are enrolled in. I want to return a json with a format like this:

[
    {
        "username": "test_user1",
        <other attributes ...>
        "groups": [
            {
                "group_id": 2,
                <other attributes ...>
            },
            {
                "group_id": 3,
                <other attributes ...>
            }
        ]
    },
    {
        "username": "test_user2",
        <other attributes ...>
        "groups": [
            {
                "group_id": 1,
                <other attributes ...>
            },
            {
                "group_id": 2,
                <other attributes ...>
            }
        ]
    },
    etc ...

There are two or three ways to query SQL for this that I can think of:

Issue a variable number of SQL queries: Query for a list of Users, then loop over each user to query over the junction linkage to populate the groups list for each user. The number of SQL queries linearly increases with the number of users returned.

example (using python flask_sqlalchemy / flask_restx):

users = db.session.query(User).filter( ... )
for u in users:
    groups = db.session.query(Group).join(UserGroup, UserGroup.group_id == Group.id) 
        .filter(UserGroup.user.id == u.id)
retobj = api.marshal([{**u.__dict__, 'groups': groups} for u in users], my_model)
# Total number of queries: 1 + number of users in result

Issue a constant number of SQL queries: This can be done by issuing one monolithic SQL query performing all joins with potentially lots of redundant data in the User’s columns, or, often more preferably, a few separate SQL queries. For example, query for a list of Users, then query the Group table joining on GroupUsers, then manually group groups in server code.

example code:

from collections import defaultdict
users = db.session.query(User).filter( ... )
uids = [u.id for u in users]
groups = db.session.query(User.user_id, Group).join(UserGroup, UserGroup.group_id == Group.id) 
        .filter(UserGroup.user_id._in(uids))
aggregate = defaultdict(list)
for g in groups:
    aggregate[g.user_id].append(g[1].__dict__)
retobj = api.marshal([{**u.__dict__, 'groups': aggregate[u.id]} for u in users], my_model)
# Total number of queries: 2

The third approach, with limited usefulness, is to use string_agg or a similar approach to force SQL to concatenate a grouping into one string column, then unpack the string into a list server-side, for example if all I want was the group number I could use string_agg and group_by to get back “1,2” in one query to the User table. But this is only useful if you don’t need complex objects.

I’m attracted to the second approach because I feel like it’s more efficient and scalable because the number of SQL queries (which I have assumed is the main bottleneck for no particularly good reason) is constant, but it takes some more work on the server’s side to filter all the groups into each user. But I thought part of the point of using SQL is to take advantage of its efficient sorting/filtering so you don’t have to do it yourself.

So my question is, am I right in thinking that it’s a good idea to make the number of SQL queries constant at the expense of more server-side processing and dev time? Is it a waste of time to try to whittle down the number of unnecessary SQL queries? Will I regret it if I don’t, when API is tested at scale? Is there a better way to solve this problem that I’m not aware of?

Answer

Using joinedload option you can load all the data with just one query:

q = (
    session.query(User)
    .options(db.joinedload(User.groups))
    .order_by(User.id)
)
users = q.all()
for user in users:
    print(user.name)
    for ug in user.groups:
        print("  ", ug.name)

When you run the query above, all the groups would have been loaded already from the database using the query similar to below:

SELECT "user".id,
       "user".name,
       group_1.id,
       group_1.name
FROM   "user"
LEFT OUTER JOIN (user_group AS user_group_1
                 JOIN "group" AS group_1 ON group_1.id = user_group_1.group_id)
            ON  "user".id = user_group_1.user_id

And now you only need to serialize the result with proper schema.

Advertisement

Answer