CONTEXT
I have a large table full of “documents” that are updated by outside sources. When I notice the updates are more recent than my last touchpoint I need to address these documents. I’m having some serious performance issues though.
EXAMPLE CODE
select count(*) from documents;
gets me back 212,494,397 documents in 1 min 15.24 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
which is apx the actual query gets me 55,988,860 in 14 min 36.23 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE) limit 1;
notably takes about 15 minutes as well. (this was surprising to me)
THE PROBLEM
How do I perform the
updated_at > last_indexed_at
in a more reasonable time?
DETAILS
I’m pretty certain that my query is, in some way, not sargable. Unfortunately, I can’t find what about this query prevents it from being executed on a row independent basis.
select count(*)
from documents
where last_indexed_at is null or updated_at > last_indexed_at;
doesn’t do any better.
nor does
select count( distinct( id ) )
from documents
where last_indexed_at is null or updated_at > last_indexed_at limit 1;
nor does
select count( distinct( id ) )
from documents limit 1;
EDIT: FOLLOW UP REQUESTED DATA
This question only involves one table (thankfully) in a rails project, so we conveniently have the rails definition for the table.
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `documents` (
`id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_type` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`locale` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`allowed_ids` text COLLATE utf8mb4_unicode_ci NOT NULL,
`fields` mediumtext COLLATE utf8mb4_unicode_ci,
`created_at` datetime(6) NOT NULL,
`updated_at` datetime(6) NOT NULL,
`last_indexed_at` datetime(6) DEFAULT NULL,
`deleted_at` datetime(6) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_documents_on_document_type` (`document_type`),
KEY `index_documents_on_locale` (`locale`),
KEY `index_documents_on_last_indexed_at` (`last_indexed_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
SELECT VERSION();
got me 5.7.27-30-log
And probably most import,
explain select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
gets me exactly
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| 1 | SIMPLE | documents | NULL | ALL | NULL | NULL | NULL | NULL | 208793754 | 100.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
Advertisement
Answer
Oh! MySQL 5.7 introduced Generated Columns â which gives us a way of indexing expressions! ð¥³
If you do something like this:
ALTER TABLE documents
ADD COLUMN dirty BOOL GENERATED ALWAYS AS (COALESCE(updated_at > last_indexed_at, TRUE)) STORED,
ADD INDEX index_documents_on_dirty(dirty);
…and change the query to:
SELECT COUNT(*) FROM documents WHERE dirty;
…what results do you get?
Hopefully, we’re moving the work of evaluating COALESCE(updated_at > last_indexed_at, TRUE)
from Read time to Write time.