Skip to content
Advertisement

Selecting distinct values from a join of two large tables

I have an animals table with about 3 million records. The table has, among a few other columns, an id, name, and owner_id column. I have an animal_breeds table with about 2.5 million records. The table only has an animal_id and breed column.

I’m trying to find the distinct breed values that are associated with a specific owner_id, but the query is taking 20 seconds or so. Here’s the query:

SELECT DISTINCT `breed`
FROM `animal_breeds` 
INNER JOIN `animals` ON `animals`.`id` = `animal_breeds`.`animal_id` 
WHERE `animals`.`owner_id` = ? ;

The tables have all appropriate indices. I can’t denormalize the table by adding a breed column to the animals table because it is possible for animals to be assigned multiple breeds. I also have this problem with a few other large tables that have one-to-many relationships.

Is there a more performant way to achieve what I’m looking for? It seems like a pretty simple problem but I can’t seem to figure out the best way to achieve this other than pre-calculating and caching the results.

Here is the explain output from my query. Notice the Using temporary

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   "SIMPLE"    "a" NULL    "ref"   "PRIMARY,animals_animal_id_index"   "animals_animal_id_index"   "153"   "const" 1126303 100.00  "Using index; Using temporary"
1   "SIMPLE"    "ab"    NULL    "ref"   "animal_breeds_animal_id_breed_unique,animal_breeds_animal_id_index,animal_breeds_breed_index"  "animal_breeds_animal_id_breed_unique"  "5" "pedigreeonline.a.id"   1   100.00  "Using index"

And as requested, here are the create table statements (I left off a few unrelated columns and indices from the animals table). I believe the animal_breeds_animal_id_index index on animal_breeds table is redundant because of the unique key on the table, but we can ignore that for now as long as it’s not causing the problem 🙂

CREATE TABLE `animals` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
  `owner_id` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `animals_animal_id_index` (`owner_id`,`id`),
  KEY `animals_name_index` (`name`),
) ENGINE=InnoDB AUTO_INCREMENT=2470843 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci


CREATE TABLE `animal_breeds` (
  `animal_id` int(10) unsigned DEFAULT NULL,
  `breed` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  UNIQUE KEY `animal_breeds_animal_id_breed_unique` (`animal_id`,`breed`),
  KEY `animal_breeds_animal_id_index` (`animal_id`),
  KEY `animal_breeds_breed_index` (`breed`),
  CONSTRAINT `animal_breeds_animal_id_foreign` FOREIGN KEY (`animal_id`) REFERENCES `animals` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Any help would be appreciated. Thanks!

Advertisement

Answer

With knowledge about your data you can try something like this:

SELECT
    b.*
FROM
    (
        SELECT
            DISTINCT `breed`
        FROM
            `animal_breeds`
    ) AS b
WHERE
    EXISTS (
        SELECT
            *
        FROM
            animal_breeds AS ab
            INNER JOIN animals AS a ON ab.animal_id = a.id
        WHERE
            b.breed = ab.breed
            AND a.owner_id = ?
    )
;

The idea is to get short list of distinct breeds without any filtering (for small list it would be quite fast) and then filter further the list with correlated subquery. As the list is short it would be only few subqueries executed and they will only check for existence that is much faster that any grouping (distinct == grouping).

This will only work if your distinct list is quite short.

With random generated data based on your answers the above query gave me the following execution plan:

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY <derived2>      ALL                 2   100.00  
3   SUBQUERY    a       ref PRIMARY,animals_animal_id_index animals_animal_id_index 153 const   1011    100.00  Using index
3   SUBQUERY    ab      ref animal_breeds_animal_id_breed_unique,`animal_breeds_animal_id_index`,animal_breeds_animal_id_index  `animal_breeds_animal_id_index` 5   test.a.id   2   100.00  Using index
2   DERIVED animal_breeds       range   animal_breeds_animal_id_breed_unique,`animal_breeds_breed_index`,animal_breeds_breed_index  `animal_breeds_breed_index` 1022        2   100.00  Using index for group-by

Alternatively, you can try to create WHERE clause like this:

...
WHERE
    b.breed IN (
        SELECT
            ab.breed
        FROM
            animal_breeds AS ab
            INNER JOIN animals AS a ON ab.animal_id = a.id
        WHERE
            a.owner_id = ?
    )
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement