I have an animals
table with about 3 million records. The table has, among a few other columns, an id
, name
, and owner_id
column. I have an animal_breeds
table with about 2.5 million records. The table only has an animal_id
and breed
column.
I’m trying to find the distinct breed
values that are associated with a specific owner_id
, but the query is taking 20 seconds or so. Here’s the query:
SELECT DISTINCT `breed` FROM `animal_breeds` INNER JOIN `animals` ON `animals`.`id` = `animal_breeds`.`animal_id` WHERE `animals`.`owner_id` = ? ;
The tables have all appropriate indices. I can’t denormalize the table by adding a breed
column to the animals
table because it is possible for animals to be assigned multiple breeds. I also have this problem with a few other large tables that have one-to-many relationships.
Is there a more performant way to achieve what I’m looking for? It seems like a pretty simple problem but I can’t seem to figure out the best way to achieve this other than pre-calculating and caching the results.
Here is the explain output from my query. Notice the Using temporary
id select_type table partitions type possible_keys key key_len ref rows filtered Extra 1 "SIMPLE" "a" NULL "ref" "PRIMARY,animals_animal_id_index" "animals_animal_id_index" "153" "const" 1126303 100.00 "Using index; Using temporary" 1 "SIMPLE" "ab" NULL "ref" "animal_breeds_animal_id_breed_unique,animal_breeds_animal_id_index,animal_breeds_breed_index" "animal_breeds_animal_id_breed_unique" "5" "pedigreeonline.a.id" 1 100.00 "Using index"
And as requested, here are the create table statements (I left off a few unrelated columns and indices from the animals
table). I believe the animal_breeds_animal_id_index
index on animal_breeds
table is redundant because of the unique key on the table, but we can ignore that for now as long as it’s not causing the problem 🙂
CREATE TABLE `animals` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `name` varchar(150) COLLATE utf8_unicode_ci NOT NULL, `owner_id` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL, PRIMARY KEY (`id`), KEY `animals_animal_id_index` (`owner_id`,`id`), KEY `animals_name_index` (`name`), ) ENGINE=InnoDB AUTO_INCREMENT=2470843 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci CREATE TABLE `animal_breeds` ( `animal_id` int(10) unsigned DEFAULT NULL, `breed` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL, UNIQUE KEY `animal_breeds_animal_id_breed_unique` (`animal_id`,`breed`), KEY `animal_breeds_animal_id_index` (`animal_id`), KEY `animal_breeds_breed_index` (`breed`), CONSTRAINT `animal_breeds_animal_id_foreign` FOREIGN KEY (`animal_id`) REFERENCES `animals` (`id`) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Any help would be appreciated. Thanks!
Advertisement
Answer
With knowledge about your data you can try something like this:
SELECT b.* FROM ( SELECT DISTINCT `breed` FROM `animal_breeds` ) AS b WHERE EXISTS ( SELECT * FROM animal_breeds AS ab INNER JOIN animals AS a ON ab.animal_id = a.id WHERE b.breed = ab.breed AND a.owner_id = ? ) ;
The idea is to get short list of distinct breeds without any filtering (for small list it would be quite fast) and then filter further the list with correlated subquery. As the list is short it would be only few subqueries executed and they will only check for existence that is much faster that any grouping (distinct == grouping).
This will only work if your distinct list is quite short.
With random generated data based on your answers the above query gave me the following execution plan:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra 1 PRIMARY <derived2> ALL 2 100.00 3 SUBQUERY a ref PRIMARY,animals_animal_id_index animals_animal_id_index 153 const 1011 100.00 Using index 3 SUBQUERY ab ref animal_breeds_animal_id_breed_unique,`animal_breeds_animal_id_index`,animal_breeds_animal_id_index `animal_breeds_animal_id_index` 5 test.a.id 2 100.00 Using index 2 DERIVED animal_breeds range animal_breeds_animal_id_breed_unique,`animal_breeds_breed_index`,animal_breeds_breed_index `animal_breeds_breed_index` 1022 2 100.00 Using index for group-by
Alternatively, you can try to create WHERE clause like this:
... WHERE b.breed IN ( SELECT ab.breed FROM animal_breeds AS ab INNER JOIN animals AS a ON ab.animal_id = a.id WHERE a.owner_id = ? )