select MAX(id) from studenthistory where class_id = 1 and date(created_at) = '2021-11-05' and time(created_at) > TIME('04:00:00') group by student_id composite indexes = ("class_id", "student_id", "created_at")
id is the primary key.
Is date(created_at) = '2021-11-05' and time(created_at) > TIME('04:00:00')
filter condition unnecessary for Max function since studenthistory is already indexed on class_id and student_id?
The only reason I added that datetime filter is because this table will get huge over time. (historical data) And I wanted to reduce the number of rows the query has to search.
But for the case of Max function – I believe MAX would simply fetch the last value without checking the whole row, if it is indexed.
So can i safely remove the datetime filter and turn it into
select MAX(id) from studenthistory where class_id = 1 group by student_id
And have the same performance? (or better since it does not need to filter further?)
Checking the query plan seems like the performance is similar, but the size of the table is rather small as of now..
First:
| -> Group aggregate: max(id) (cost=1466.30 rows=7254) (actual time=2.555..5.766 rows=3 loops=1) -> Filter: ((cast(studenthistory.created_at as date) = '2021-11-05') and (cast(riderlocation.created_at as time(6)) > <cache>(cast('04:00:00' as time)))) (cost=740.90 rows=7254) (actual time=0.762..5.384 rows=5349 loops=1) -> Index lookup on studenthistory using idx_studenthistory_class_id_931474 (class_id=1) (cost=740.90 rows=7254) (actual time=0.029..3.589 rows=14638 loops=1) | 1 row in set (0.00 sec)
Second:
| -> Group aggregate: max(studenthistory.id) (cost=1475.40 rows=7299) (actual time=0.545..5.271 rows=10 loops=1) -> Index lookup on studenthistory using idx_studenthistory_class_id_931474 (class_id=1) (cost=745.50 rows=7299) (actual time=0.026..4.164 rows=14729 loops=1) | 1 row in set (0.01 sec)
Many thanks in advance
UPDATE: applying @rick james’s suggestion:
Changed index to (class_id, student_id, id).
FLUSH STATUS; explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `studenthistory` WHERE `class_id`=1 AND `created_at`>='2021-11-05T18:25:50.544850+00:00' GROUP BY `student_id`; | { "query_block": { "select_id": 1, "cost_info": { "query_cost": "940.10" }, "grouping_operation": { "using_filesort": false, "table": { "table_name": "studenthistory", "access_type": "ref", "possible_keys": [ "fk_studenthist_student_e25b0310", "idx_studenthistory_class_id_931474" ], "key": "idx_studenthistory_class_id_931474", "used_key_parts": [ "class_id" ], "key_length": "4", "ref": [ "const" ], "rows_examined_per_scan": 8381, "rows_produced_per_join": 2793, "filtered": "33.33", "cost_info": { "read_cost": "102.00", "eval_cost": "279.34", "prefix_cost": "940.10", "data_read_per_join": "130K" }, "used_columns": [ "id", "created_at", "student_id", "class_id" ], "attached_condition": "(`test-table`.`studenthistory`.`created_at` >= TIMESTAMP'2021-11-05 18:25:50.54485')" } } } } |
i.e. only class_id is used as an index, (as created_at is no longer in the index. rows_produced_per_join is lower due to filter: 2793,
Without datetime filter:
FLUSH STATUS; mysql> explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `studenthistory` WHERE `class_id`=1 GROUP BY `student_id`; | { "query_block": { "select_id": 1, "cost_info": { "query_cost": "854.75" }, "grouping_operation": { "using_filesort": false, "table": { "table_name": "studenthistory", "access_type": "ref", "possible_keys": [ "fk_studenthistory_student_e25b0310", "idx_studenthistory_class_id_931474" ], "key": "idx_studenthistory_class_id_931474", "used_key_parts": [ "class_id" ], "key_length": "4", "ref": [ "const" ], "rows_examined_per_scan": 8381, "rows_produced_per_join": 8381, "filtered": "100.00", "using_index": true, "cost_info": { "read_cost": "16.65", "eval_cost": "838.10", "prefix_cost": "854.75", "data_read_per_join": "392K" }, "used_columns": [ "id", "student_id", "class_id" ] } } } } |
Runs on all 3 indexes (“class_id”, “student_id”, “id”), same 8381 number of rows slightly lower query cost (940 -> 854)
Applying the first query with original index (“class_id”, “student_id”, “created_at”) yields:
FLUSH STATUS; explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `studenthistory` WHERE `class_id`=1 AND `created_at`>='2021-11-05T18:25:50.544850+00:00' GROUP BY `student_id`; | { "query_block": { "select_id": 1, "cost_info": { "query_cost": "858.94" }, "grouping_operation": { "using_filesort": false, "table": { "table_name": "studenthistory", "access_type": "ref", "possible_keys": [ "fk_studenthistory_student_e25b0310", "idx_studenthistory_class_id_931474" ], "key": "idx_studenthistory_class_id_931474", "used_key_parts": [ "class_id" ], "key_length": "4", "ref": [ "const" ], "rows_examined_per_scan": 8381, "rows_produced_per_join": 2793, "filtered": "33.33", "using_index": true, "cost_info": { "read_cost": "20.84", "eval_cost": "279.34", "prefix_cost": "858.94", "data_read_per_join": "130K" }, "used_columns": [ "id", "created_at", "student_id", "class_id" ], "attached_condition": "(`test-table`.`studenthistory`.`created_at` >= TIMESTAMP'2021-11-05 18:25:50.54485')" } } } } |
The cost this time is 858, rows “rows_examined_per_scan”: 8381, “rows_produced_per_join”: 2793. Only class_id was used as key however. (why.?) not the remaining student_id and created_at
Advertisement
Answer
Query 1
select MAX(id) from studenthistory where class_id = 1 and date(created_at) = '2021-11-05' and time(created_at) > TIME('04:00:00') group by student_id
Don’t split up the date; change to
AND created_at > '2021-11-05 04:00:00'
If you want to check rows that were ‘created’ on the day, use something
AND created_at >= '2021-11-05' AND created_at < '2021-11-05' + INTERVAL 1 DAY
Or, if you want to check for “today”:
AND created_at >= CURDATE()
After 4am this morning:
AND created_at >= CURDATE() + INTERVAL 4 HOUR
Using date(created_at)
makes the created_at
part of the INDEX
unusable. (cf “sargable”)
select MAX(id) ... group by student_id
Is likely to return multiple rows — one per student. Perhaps you want to get rid of the group by
? Or specify a particular student_id
?
Query 2 may run faster:
select MAX(id) from studenthistory where class_id = 1 group by student_id
But the optimal index is INDEX(class_id, student_id, id)
, (It is OK to include both composite indexes.)
It may return multiple rows, so perhaps you want
select student_id, MAX(id) from studenthistory where class_id = 1 group by student_id
MAX
I believe MAX would simply fetch the last value without checking the whole row, if it is indexed.
Sometimes.
Your second query can do that. But the first query cannot — because of the range test (on created_at) being in the way.
EXPLAIN
query plan seems … similar
Alas, EXPLAIN
leaves details out. You can get some more details with EXPLAIN FORMAT=JSON SELECT ...
, but not necessarily enough details.
I think you will find that the second query will give a much smaller value for “Rows” after adding my suggested index.
A way to get an accurate measure of “rows (table or index) touched”:
FLUSH STATUS; SELECT ...; SHOW SESSION STATUS LIKE 'Handler%';
Sensor data
For sensor data, consider multiple tables:
- The raw data (“Fact” table, in Data Warehouse terminology). This has one row per reading per sensor.
- The latest value for each sensor. This has one row for each of the 90K sensors. It will be a lot easier to maintain this table than to “find the latest” value for each sensor; that’s a “groupwise-maximum” problem.
- Summary data. An example is to have high/low/average/etc values for each sensor. This has one row per hour (or day or whatever is useful) per sensor.