Skip to content
Advertisement

SQL MAX on primary key, is filter condition unncessary if it is already indexed?

select MAX(id) from studenthistory 
where class_id = 1 
   and date(created_at) = '2021-11-05' 
   and time(created_at) > TIME('04:00:00') 
group by student_id

composite indexes = ("class_id", "student_id", "created_at")

id is the primary key.

Is date(created_at) = '2021-11-05' and time(created_at) > TIME('04:00:00') filter condition unnecessary for Max function since studenthistory is already indexed on class_id and student_id?

The only reason I added that datetime filter is because this table will get huge over time. (historical data) And I wanted to reduce the number of rows the query has to search.

But for the case of Max function – I believe MAX would simply fetch the last value without checking the whole row, if it is indexed.

So can i safely remove the datetime filter and turn it into

select MAX(id) from studenthistory 
where class_id = 1 
group by student_id

And have the same performance? (or better since it does not need to filter further?)

Checking the query plan seems like the performance is similar, but the size of the table is rather small as of now..

First:

| -> Group aggregate: max(id)  (cost=1466.30 rows=7254) (actual time=2.555..5.766 rows=3 loops=1)
    -> Filter: ((cast(studenthistory.created_at as date) = '2021-11-05') and (cast(riderlocation.created_at as time(6)) > <cache>(cast('04:00:00' as time))))  (cost=740.90 rows=7254) (actual time=0.762..5.384 rows=5349 loops=1)
        -> Index lookup on studenthistory using idx_studenthistory_class_id_931474 (class_id=1)  (cost=740.90 rows=7254) (actual time=0.029..3.589 rows=14638 loops=1)
 |

1 row in set (0.00 sec)

Second:

| -> Group aggregate: max(studenthistory.id)  (cost=1475.40 rows=7299) (actual time=0.545..5.271 rows=10 loops=1)
    -> Index lookup on studenthistory using idx_studenthistory_class_id_931474 (class_id=1)  (cost=745.50 rows=7299) (actual time=0.026..4.164 rows=14729 loops=1)
 |
1 row in set (0.01 sec)

Many thanks in advance


UPDATE: applying @rick james’s suggestion:

Changed index to (class_id, student_id, id).

FLUSH STATUS;
explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `studenthistory`
    WHERE `class_id`=1 AND `created_at`>='2021-11-05T18:25:50.544850+00:00'
    GROUP BY `student_id`;


| {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "940.10"
    },
    "grouping_operation": {
      "using_filesort": false,
      "table": {
        "table_name": "studenthistory",
        "access_type": "ref",
        "possible_keys": [
          "fk_studenthist_student_e25b0310",
          "idx_studenthistory_class_id_931474"
        ],
        "key": "idx_studenthistory_class_id_931474",
        "used_key_parts": [
          "class_id"
        ],
        "key_length": "4",
        "ref": [
          "const"
        ],
        "rows_examined_per_scan": 8381,
        "rows_produced_per_join": 2793,
        "filtered": "33.33",
        "cost_info": {
          "read_cost": "102.00",
          "eval_cost": "279.34",
          "prefix_cost": "940.10",
          "data_read_per_join": "130K"
        },
        "used_columns": [
          "id",
          "created_at",
          "student_id",
          "class_id"
        ],
        "attached_condition": "(`test-table`.`studenthistory`.`created_at` >= TIMESTAMP'2021-11-05 18:25:50.54485')"
      }
    }
  }
} |

i.e. only class_id is used as an index, (as created_at is no longer in the index. rows_produced_per_join is lower due to filter: 2793,

Without datetime filter:

FLUSH STATUS;
mysql> explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `studenthistory`
    WHERE `class_id`=1  GROUP BY `student_id`;


| {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "854.75"
    },
    "grouping_operation": {
      "using_filesort": false,
      "table": {
        "table_name": "studenthistory",
        "access_type": "ref",
        "possible_keys": [
          "fk_studenthistory_student_e25b0310",
          "idx_studenthistory_class_id_931474"
        ],
        "key": "idx_studenthistory_class_id_931474",
        "used_key_parts": [
          "class_id"
        ],
        "key_length": "4",
        "ref": [
          "const"
        ],
        "rows_examined_per_scan": 8381,
        "rows_produced_per_join": 8381,
        "filtered": "100.00",
        "using_index": true,
        "cost_info": {
          "read_cost": "16.65",
          "eval_cost": "838.10",
          "prefix_cost": "854.75",
          "data_read_per_join": "392K"
        },
        "used_columns": [
          "id",
          "student_id",
          "class_id"
        ]
      }
    }
  }
} |

Runs on all 3 indexes (“class_id”, “student_id”, “id”), same 8381 number of rows slightly lower query cost (940 -> 854)

Applying the first query with original index (“class_id”, “student_id”, “created_at”) yields:

FLUSH STATUS;
    explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `studenthistory`
    WHERE `class_id`=1 AND `created_at`>='2021-11-05T18:25:50.544850+00:00'
    GROUP BY `student_id`;

| {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "858.94"
    },
    "grouping_operation": {
      "using_filesort": false,
      "table": {
        "table_name": "studenthistory",
        "access_type": "ref",
        "possible_keys": [
          "fk_studenthistory_student_e25b0310",
          "idx_studenthistory_class_id_931474"
        ],
        "key": "idx_studenthistory_class_id_931474",
        "used_key_parts": [
          "class_id"
        ],
        "key_length": "4",
        "ref": [
          "const"
        ],
        "rows_examined_per_scan": 8381,
        "rows_produced_per_join": 2793,
        "filtered": "33.33",
        "using_index": true,
        "cost_info": {
          "read_cost": "20.84",
          "eval_cost": "279.34",
          "prefix_cost": "858.94",
          "data_read_per_join": "130K"
        },
        "used_columns": [
          "id",
          "created_at",
          "student_id",
          "class_id"
        ],
        "attached_condition": "(`test-table`.`studenthistory`.`created_at` >= TIMESTAMP'2021-11-05 18:25:50.54485')"
      }
    }
  }
} |

The cost this time is 858, rows “rows_examined_per_scan”: 8381, “rows_produced_per_join”: 2793. Only class_id was used as key however. (why.?) not the remaining student_id and created_at

Advertisement

Answer

Query 1

select MAX(id) from studenthistory 
where class_id = 1 
   and date(created_at) = '2021-11-05' 
   and time(created_at) > TIME('04:00:00') 
group by student_id

Don’t split up the date; change to

AND created_at > '2021-11-05 04:00:00'

If you want to check rows that were ‘created’ on the day, use something

AND created_at >= '2021-11-05'
AND created_at  < '2021-11-05' + INTERVAL 1 DAY

Or, if you want to check for “today”:

AND created_at >= CURDATE()

After 4am this morning:

AND created_at >= CURDATE() + INTERVAL 4 HOUR

Using date(created_at) makes the created_at part of the INDEX unusable. (cf “sargable”)

select MAX(id) ... group by student_id

Is likely to return multiple rows — one per student. Perhaps you want to get rid of the group by? Or specify a particular student_id?

Query 2 may run faster:

select MAX(id) from studenthistory 
where class_id = 1 
group by student_id

But the optimal index is INDEX(class_id, student_id, id), (It is OK to include both composite indexes.)

It may return multiple rows, so perhaps you want

select student_id, MAX(id) from studenthistory 
where class_id = 1 
group by student_id

MAX

I believe MAX would simply fetch the last value without checking the whole row, if it is indexed.

Sometimes.

Your second query can do that. But the first query cannot — because of the range test (on created_at) being in the way.

EXPLAIN

query plan seems … similar

Alas, EXPLAIN leaves details out. You can get some more details with EXPLAIN FORMAT=JSON SELECT ..., but not necessarily enough details.

I think you will find that the second query will give a much smaller value for “Rows” after adding my suggested index.

A way to get an accurate measure of “rows (table or index) touched”:

FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';

Sensor data

For sensor data, consider multiple tables:

  • The raw data (“Fact” table, in Data Warehouse terminology). This has one row per reading per sensor.
  • The latest value for each sensor. This has one row for each of the 90K sensors. It will be a lot easier to maintain this table than to “find the latest” value for each sensor; that’s a “groupwise-maximum” problem.
  • Summary data. An example is to have high/low/average/etc values for each sensor. This has one row per hour (or day or whatever is useful) per sensor.
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement