So I have a bit of an unusual request. I’m working with a table with billions of rows.
The table has a column ‘id’ which is not unique, and has a column ‘data’
What I want to do is run a count on the number of rows grouped by the ‘id’, but limit the counting to only 150 entries. I only need to know if there are 150 rows by any given id.
This is in an effort to optimize the query and performance.
It doesn’t have to be a count. I only need to know if a given id has 150 entries, without have MySQL continue counting entries during the query. If that makes sense.
I know how to count, and I know how to group, and I know how to do both, but the count will come back with a number in the millions which is wasted processing time and the query needs to run on hundred of thousands of ids.
Advertisement
Answer
You can’t really optimize performance for this — I don’t think.
select id, (count(*) >= 150) from t group by id;
If you happen to have a separate table with one row per id and an index on t(id)
, then this might be faster:
select ids.id, ((select count(*) from t where t.id = ids.id ) >= 150 ) from ids;
Unfortunately, MySQL does not support double nesting for correlated subqueries, so this is not possible:
select ids.id, ((select count(*) from (select 1 from t where t.id = ids.id limit 150 ) t ) >= 150 ) from ids;
If so, this might be faster.
EDIT:
If you have an index on id
and only want ids that have 150 or more, then variables might be faster:
select id, (@rn := if(@id = id, @rn + 1, if(@id := id, 1, 1) ) ) as rn from (select id from t order by id ) t cross join (select @id := 0, @rn := 0) params having rn = 150;
The thinking here is that using the index to order the table, materializing, and scanning again is probably faster than group by
. I don’t think row_number()
would have the same performance characteristics.
EDIT II:
A slight variation on the above can be used to get all ids with a flag:
select id, (max(id) = 150) from (select id, (@rn := if(@id = id, @rn + 1, if(@id := id, 1, 1) ) ) as rn from (select id from t order by id ) t cross join (select @id := 0, @rn := 0) params having rn in (1, 150) ) t group by id;
EDIT III:
Ahh! If you have a separate table of ids, then this might be the best approach:
select ids.id, (select id from t where t.id = ids.id limit 1 offset 149 ) is not null from ids;
This will fetch the 150th row from the index. If it not there, then no row is returned.