CONTEXT
I have a large table full of “documents” that are updated by outside sources. When I notice the updates are more recent than my last touchpoint I need to address these documents. I’m having some serious performance issues though.
EXAMPLE CODE
select count(*) from documents;
gets me back 212,494,397 documents in 1 min 15.24 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
which is apx the actual query gets me 55,988,860 in 14 min 36.23 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE) limit 1;
notably takes about 15 minutes as well. (this was surprising to me)
THE PROBLEM
How do I perform the
updated_at > last_indexed_at
in a more reasonable time?
DETAILS
I’m pretty certain that my query is, in some way, not sargable. Unfortunately, I can’t find what about this query prevents it from being executed on a row independent basis.
select count(*) from documents where last_indexed_at is null or updated_at > last_indexed_at;
doesn’t do any better.
nor does
select count( distinct( id ) ) from documents where last_indexed_at is null or updated_at > last_indexed_at limit 1;
nor does
select count( distinct( id ) ) from documents limit 1;
EDIT: FOLLOW UP REQUESTED DATA
This question only involves one table (thankfully) in a rails project, so we conveniently have the rails definition for the table.
/*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `documents` ( `id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL, `document_id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL, `document_type` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL, `locale` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL, `allowed_ids` text COLLATE utf8mb4_unicode_ci NOT NULL, `fields` mediumtext COLLATE utf8mb4_unicode_ci, `created_at` datetime(6) NOT NULL, `updated_at` datetime(6) NOT NULL, `last_indexed_at` datetime(6) DEFAULT NULL, `deleted_at` datetime(6) DEFAULT NULL, PRIMARY KEY (`id`), KEY `index_documents_on_document_type` (`document_type`), KEY `index_documents_on_locale` (`locale`), KEY `index_documents_on_last_indexed_at` (`last_indexed_at`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
SELECT VERSION();
got me 5.7.27-30-log
And probably most import,
explain select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
gets me exactly
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+ | 1 | SIMPLE | documents | NULL | ALL | NULL | NULL | NULL | NULL | 208793754 | 100.00 | Using where | +----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
Advertisement
Answer
Oh! MySQL 5.7 introduced Generated Columns â which gives us a way of indexing expressions! ð¥³
If you do something like this:
ALTER TABLE documents ADD COLUMN dirty BOOL GENERATED ALWAYS AS (COALESCE(updated_at > last_indexed_at, TRUE)) STORED, ADD INDEX index_documents_on_dirty(dirty);
…and change the query to:
SELECT COUNT(*) FROM documents WHERE dirty;
…what results do you get?
Hopefully, we’re moving the work of evaluating COALESCE(updated_at > last_indexed_at, TRUE)
from Read time to Write time.