Skip to content
Advertisement

SQL query to count rows grouped by an ID, but limit count on each group

So I have a bit of an unusual request. I’m working with a table with billions of rows.

The table has a column ‘id’ which is not unique, and has a column ‘data’

What I want to do is run a count on the number of rows grouped by the ‘id’, but limit the counting to only 150 entries. I only need to know if there are 150 rows by any given id.

This is in an effort to optimize the query and performance.

It doesn’t have to be a count. I only need to know if a given id has 150 entries, without have MySQL continue counting entries during the query. If that makes sense.

I know how to count, and I know how to group, and I know how to do both, but the count will come back with a number in the millions which is wasted processing time and the query needs to run on hundred of thousands of ids.

Advertisement

Answer

You can’t really optimize performance for this — I don’t think.

select id, (count(*) >= 150)
from t
group by id;

If you happen to have a separate table with one row per id and an index on t(id), then this might be faster:

select ids.id,
       ((select count(*)
         from t
         where t.id = ids.id
        ) >= 150
       )
from ids;

Unfortunately, MySQL does not support double nesting for correlated subqueries, so this is not possible:

select ids.id,
       ((select count(*)
         from (select 1
               from t
               where t.id = ids.id
               limit 150
              ) t
        ) >= 150
       )
from ids;

If so, this might be faster.

EDIT:

If you have an index on id and only want ids that have 150 or more, then variables might be faster:

select id,
       (@rn := if(@id = id, @rn + 1,
                  if(@id := id, 1, 1)
                 )
       ) as rn
from (select id
      from t
      order by id
     ) t cross join
     (select @id := 0, @rn := 0) params
having rn = 150;

The thinking here is that using the index to order the table, materializing, and scanning again is probably faster than group by. I don’t think row_number() would have the same performance characteristics.

EDIT II:

A slight variation on the above can be used to get all ids with a flag:

select id, (max(id) = 150)
from (select id,
             (@rn := if(@id = id, @rn + 1,
                        if(@id := id, 1, 1)
                       )
             ) as rn
      from (select id
            from t
            order by id
           ) t cross join
           (select @id := 0, @rn := 0) params
      having rn in (1, 150)
     ) t
group by id;

EDIT III:

Ahh! If you have a separate table of ids, then this might be the best approach:

select ids.id,
       (select id
        from t
        where t.id = ids.id
        limit 1 offset 149
       ) is not null
from ids;

This will fetch the 150th row from the index. If it not there, then no row is returned.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement