I'm querying shopping-carts in a shop-system, like: I need to query records of c with a column which shows the number of carts which have the same user OR the same email. So I do: This works, but the problem is that the OR is very slow, because MySQL/MariaDB doesn't use any key in the subquery: Even forcing the index

Replace “OR” on 2 indexes with a faster solution (UNION?)

I’m querying shopping-carts in a shop-system, like:

DROP TABLE IF EXISTS c;
CREATE TABLE c (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `user` int(10) unsigned DEFAULT NULL,
  `email` VARCHAR(255) NOT NULL DEFAULT '', 
  `number` VARCHAR(20) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `user`(`user`),
  KEY `email`(`email`),
  UNIQUE KEY `number`(`number`)
) ENGINE=InnoDB;

INSERT INTO c SET user=1, email="test1@example.com", number="00001";
INSERT INTO c SET user=2, email="test2@example.com", number="00002";
INSERT INTO c SET user=3, email="test3@example.com", number="00003";
INSERT INTO c SET user=4, email="test1@example.com", number="00004";
INSERT INTO c SET user=1, email="test1@example.com", number="00005";

I need to query records of c with a column which shows the number of carts which have the same user OR the same email. So I do:

SELECT c.number, 
       (SELECT COUNT(DISTINCT (id)) FROM c AS c2
                  WHERE c2.email = c.email OR c2.user = c.user
       ) AS ordercount
FROM c;
   

+--------+------------+
| number | ordercount |
+--------+------------+
| 00001  |          3 |
| 00002  |          1 |
| 00003  |          1 |
| 00004  |          3 |
| 00005  |          3 |
+--------+------------+

This works, but the problem is that the OR is very slow, because MySQL/MariaDB doesn’t use any key in the subquery:

EXPLAIN SELECT c.number, 
               (SELECT COUNT(DISTINCT (id)) FROM c AS c2
                   WHERE c2.email = c.email OR c2.user = c.user
               ) AS ordercount
        FROM c;

+----+--------------------+-------+------------+------+---------------------------+--    ----+---------+------+------+----------+-------------+
| id | select_type        | table | partitions | type | possible_keys             | key  | key_len | ref  | rows | filtered | Extra       |
+----+--------------------+-------+------------+------+---------------------------+------+---------+------+------+----------+-------------+
|  1 | PRIMARY            | c     | NULL       | ALL  | NULL                      | NULL | NULL    | NULL |    5 |   100.00 | NULL        |
|  2 | DEPENDENT SUBQUERY | c2    | NULL       | ALL  | PRIMARY,number,user,email | NULL | NULL    | NULL |    5 |    36.00 | Using where |
+----+--------------------+-------+------------+------+---------------------------+------+---------+------+------+----------+-------------+

Even forcing the index doesn’t make the DB using it:

EXPLAIN SELECT c.number, 
               (SELECT COUNT(DISTINCT (id)) FROM c AS c2 FORCE INDEX(email, user)
                  WHERE c2.email = c.email OR c2.user = c.user
               ) AS ordercount
        FROM c;

+----+--------------------+-------+------------+------+---------------------------+--    ----+---------+------+------+----------+-------------+
| id | select_type        | table | partitions | type | possible_keys             | key  | key_len | ref  | rows | filtered | Extra       |
+----+--------------------+-------+------------+------+---------------------------+------+---------+------+------+----------+-------------+
|  1 | PRIMARY            | c     | NULL       | ALL  | NULL                      | NULL | NULL    | NULL |    5 |   100.00 | NULL        |
|  2 | DEPENDENT SUBQUERY | c2    | NULL       | ALL  | PRIMARY,number,user,email | NULL | NULL    | NULL |    5 |    36.00 | Using where |
+----+--------------------+-------+------------+------+---------------------------+------+---------+------+------+----------+-------------+

Using either column “email” or column “user” works fine, the key is used:

EXPLAIN SELECT c.number, 
               (SELECT COUNT(DISTINCT (id)) FROM c AS c2 WHERE c2.email = c.email) AS ordercount
        FROM c;

+----+--------------------+-------+------------+------+---------------------------+-------+---------+--------------+------+----------+-------------+
| id | select_type        | table | partitions | type | possible_keys             | key   | key_len | ref          | rows | filtered | Extra       |
+----+--------------------+-------+------------+------+---------------------------+-------+---------+--------------+------+----------+-------------+
|  1 | PRIMARY            | c     | NULL       | ALL  | NULL                      | NULL  | NULL    | NULL         |    5 |   100.00 | NULL        |
|  2 | DEPENDENT SUBQUERY | c2    | NULL       | ref  | PRIMARY,number,user,email | email | 767     | test.c.email |    3 |   100.00 | Using index |
+----+--------------------+-------+------------+------+---------------------------+-------+---------+--------------+------+----------+-------------+

The problem is that the query runs on large table with about 500.000 entries, making the query taking about 30 seconds only to query a subset of 50 records. Running the query only with the match for “email” or only with the match for “user” it just takes about 1 second for 50 records.

So I need to optimize the query. I tried to change the OR into an UNION:

SELECT c.number, 
(SELECT COUNT(DISTINCT (id)) FROM 
    ((SELECT u1.id FROM c AS u1 WHERE
     u1.email = c.email
    )
    UNION DISTINCT
    (SELECT u2.id FROM c AS u2 WHERE
    u2.user = c.user
    )) AS u2
) AS ordercount
FROM c;

but I’m getting the error: ERROR 1054 (42S22): Unknown column ‘c.email’ in ‘where clause’

Any idea how to make this query using indexes to be faster?

Answer

This is an alternative approach using two left joins:

select c.*,
       count(distinct coalesce(ce.id, cu.id))
from c left join
     c ce
     on c.email = ce.email left join
     c cu
     on c.user = cu.user and not cu.email <=> ce.email
group by c.id;

This can use separate indexes on c(user) and c(email).

Basically, this joins along the two separate dimensions, and then brings them together for the count(distinct). There are some worse case scenarios where there might be lots of matches along both dimensions. However, in many cases this might prove to work quite well because it can use the indexes rather than scanning the entire table for each row.

Advertisement

Answer