Skip to content
Advertisement

COUNT with GROUP BY based on most recent rows only

I have a table named user_teams which has the following columns:

id: primary key
user_id: FK to users table
team_id: FK to teams table
effective_date: Date

I want to have a query that given a set of team ids, it gives me the count of users that currently belong to that team. The currently part is important and that’s why the effective_date exists. For example, if user with id 1 first belonged to team with id 1, and later belonged to team with id 2, there would be two rows:

id: 1
user_id: 1
team_id: 1
effective_date: '2020-01-01'


id: 2
user_id: 1
team_id: 2
effective_date: '2020-02-01'

And this user should just be counted for the team with id 2. I have the following query that accomplishes what I’m looking for, but I don’t like it because the subselect will basically go through the whole table:

SELECT ut.team_id, COUNT(*) as count
  FROM (
    SELECT DISTINCT ON (ut.user_id) ut.*
    FROM user_teams ut
    ORDER BY ut.user_id, ujpjl.effective_date DESC
  ) ut
WHERE ut.team_id IN (<PASSED_TEAM_IDS>)

I can’t move the WHERE inside the subselect, because if for example I were to use the previous example and I run the query for the team_id 1:

 SELECT ut.team_id, COUNT(*) as count
      FROM (
        SELECT DISTINCT ON (ut.user_id) ut.*
        FROM user_teams ut
        WHERE ut.team_id IN (1)
        ORDER BY ut.user_id, ujpjl.effective_date DESC
      ) ut

It would consider the user that at one point it was in team 1, but it no longer is.

Is there another way to make this query so that my subselect doesn’t go through all the user_teams table?

Advertisement

Answer

If you want to avoid looking at every user in the table, follow these steps:

  1. Find all users that play or played in the requested teams.
  2. Find those users’ latest entries.
  3. Determine those entries’ teams.
  4. Only keep desired teams and count.

The query:

select team_id, count(*)
from
(
  select
    team_id,
    row_number() over (partition by user_id order by effective_date desc) as rn
  from user_teams
  where user_id in
  (
    select user_id
    from user_teams
    where team_id in (1,2,3)
  )
) ranked
where rn = 1 and team_id in (1,2,3)
group by team_id
order by team_id;

Indexes:

create index idx1 on user_teams (team_id, user_id);
create index idx2 on user_teams (user_id, effective_date, team_id);

Anyway, working this way makes sense when you have, say, 10000 users with their team history in the table, but a team has just five or ten users. This means working on a small subset of the table data. Once the ratio is less extreme, it may be quicker to simply go through the whole table, i.e. use your own query. This could still benefit from the second index, as it contains all data in the appropriate order (per user -> highest date -> team).

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement