I have a table named user_teams which has the following columns:
id: primary key user_id: FK to users table team_id: FK to teams table effective_date: Date
I want to have a query that given a set of team ids, it gives me the count of users that currently belong to that team. The currently part is important and that’s why the effective_date
exists. For example, if user with id 1 first belonged to team with id 1, and later belonged to team with id 2, there would be two rows:
id: 1 user_id: 1 team_id: 1 effective_date: '2020-01-01' id: 2 user_id: 1 team_id: 2 effective_date: '2020-02-01'
And this user should just be counted for the team with id 2. I have the following query that accomplishes what I’m looking for, but I don’t like it because the subselect will basically go through the whole table:
SELECT ut.team_id, COUNT(*) as count FROM ( SELECT DISTINCT ON (ut.user_id) ut.* FROM user_teams ut ORDER BY ut.user_id, ujpjl.effective_date DESC ) ut WHERE ut.team_id IN (<PASSED_TEAM_IDS>)
I can’t move the WHERE inside the subselect, because if for example I were to use the previous example and I run the query for the team_id 1:
SELECT ut.team_id, COUNT(*) as count FROM ( SELECT DISTINCT ON (ut.user_id) ut.* FROM user_teams ut WHERE ut.team_id IN (1) ORDER BY ut.user_id, ujpjl.effective_date DESC ) ut
It would consider the user that at one point it was in team 1, but it no longer is.
Is there another way to make this query so that my subselect doesn’t go through all the user_teams table?
Advertisement
Answer
If you want to avoid looking at every user in the table, follow these steps:
- Find all users that play or played in the requested teams.
- Find those users’ latest entries.
- Determine those entries’ teams.
- Only keep desired teams and count.
The query:
select team_id, count(*) from ( select team_id, row_number() over (partition by user_id order by effective_date desc) as rn from user_teams where user_id in ( select user_id from user_teams where team_id in (1,2,3) ) ) ranked where rn = 1 and team_id in (1,2,3) group by team_id order by team_id;
Indexes:
create index idx1 on user_teams (team_id, user_id); create index idx2 on user_teams (user_id, effective_date, team_id);
Anyway, working this way makes sense when you have, say, 10000 users with their team history in the table, but a team has just five or ten users. This means working on a small subset of the table data. Once the ratio is less extreme, it may be quicker to simply go through the whole table, i.e. use your own query. This could still benefit from the second index, as it contains all data in the appropriate order (per user -> highest date -> team).