So I have a bit of an unusual request. I’m working with a table with billions of rows.
The table has a column ‘id’ which is not unique, and has a column ‘data’
What I want to do is run a count on the number of rows grouped by the ‘id’, but limit the counting to only 150 entries. I only need to know if there are 150 rows by any given id.
This is in an effort to optimize the query and performance.
It doesn’t have to be a count. I only need to know if a given id has 150 entries, without have MySQL continue counting entries during the query. If that makes sense.
I know how to count, and I know how to group, and I know how to do both, but the count will come back with a number in the millions which is wasted processing time and the query needs to run on hundred of thousands of ids.
Advertisement
Answer
You can’t really optimize performance for this — I don’t think.
select id, (count(*) >= 150) from t group by id;
If you happen to have a separate table with one row per id and an index on t(id), then this might be faster:
select ids.id,
((select count(*)
from t
where t.id = ids.id
) >= 150
)
from ids;
Unfortunately, MySQL does not support double nesting for correlated subqueries, so this is not possible:
select ids.id,
((select count(*)
from (select 1
from t
where t.id = ids.id
limit 150
) t
) >= 150
)
from ids;
If so, this might be faster.
EDIT:
If you have an index on id and only want ids that have 150 or more, then variables might be faster:
select id,
(@rn := if(@id = id, @rn + 1,
if(@id := id, 1, 1)
)
) as rn
from (select id
from t
order by id
) t cross join
(select @id := 0, @rn := 0) params
having rn = 150;
The thinking here is that using the index to order the table, materializing, and scanning again is probably faster than group by. I don’t think row_number() would have the same performance characteristics.
EDIT II:
A slight variation on the above can be used to get all ids with a flag:
select id, (max(id) = 150)
from (select id,
(@rn := if(@id = id, @rn + 1,
if(@id := id, 1, 1)
)
) as rn
from (select id
from t
order by id
) t cross join
(select @id := 0, @rn := 0) params
having rn in (1, 150)
) t
group by id;
EDIT III:
Ahh! If you have a separate table of ids, then this might be the best approach:
select ids.id,
(select id
from t
where t.id = ids.id
limit 1 offset 149
) is not null
from ids;
This will fetch the 150th row from the index. If it not there, then no row is returned.