I’m querying shopping-carts in a shop-system, like:
DROP TABLE IF EXISTS c; CREATE TABLE c ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user` int(10) unsigned DEFAULT NULL, `email` VARCHAR(255) NOT NULL DEFAULT '', `number` VARCHAR(20) NOT NULL DEFAULT '', PRIMARY KEY (`id`), KEY `user`(`user`), KEY `email`(`email`), UNIQUE KEY `number`(`number`) ) ENGINE=InnoDB; INSERT INTO c SET user=1, email="test1@example.com", number="00001"; INSERT INTO c SET user=2, email="test2@example.com", number="00002"; INSERT INTO c SET user=3, email="test3@example.com", number="00003"; INSERT INTO c SET user=4, email="test1@example.com", number="00004"; INSERT INTO c SET user=1, email="test1@example.com", number="00005";
I need to query records of c with a column which shows the number of carts which have the same user OR the same email. So I do:
SELECT c.number, (SELECT COUNT(DISTINCT (id)) FROM c AS c2 WHERE c2.email = c.email OR c2.user = c.user ) AS ordercount FROM c; +--------+------------+ | number | ordercount | +--------+------------+ | 00001 | 3 | | 00002 | 1 | | 00003 | 1 | | 00004 | 3 | | 00005 | 3 | +--------+------------+
This works, but the problem is that the OR is very slow, because MySQL/MariaDB doesn’t use any key in the subquery:
EXPLAIN SELECT c.number, (SELECT COUNT(DISTINCT (id)) FROM c AS c2 WHERE c2.email = c.email OR c2.user = c.user ) AS ordercount FROM c; +----+--------------------+-------+------------+------+---------------------------+-- ----+---------+------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------------+-------+------------+------+---------------------------+------+---------+------+------+----------+-------------+ | 1 | PRIMARY | c | NULL | ALL | NULL | NULL | NULL | NULL | 5 | 100.00 | NULL | | 2 | DEPENDENT SUBQUERY | c2 | NULL | ALL | PRIMARY,number,user,email | NULL | NULL | NULL | 5 | 36.00 | Using where | +----+--------------------+-------+------------+------+---------------------------+------+---------+------+------+----------+-------------+
Even forcing the index doesn’t make the DB using it:
EXPLAIN SELECT c.number, (SELECT COUNT(DISTINCT (id)) FROM c AS c2 FORCE INDEX(email, user) WHERE c2.email = c.email OR c2.user = c.user ) AS ordercount FROM c; +----+--------------------+-------+------------+------+---------------------------+-- ----+---------+------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------------+-------+------------+------+---------------------------+------+---------+------+------+----------+-------------+ | 1 | PRIMARY | c | NULL | ALL | NULL | NULL | NULL | NULL | 5 | 100.00 | NULL | | 2 | DEPENDENT SUBQUERY | c2 | NULL | ALL | PRIMARY,number,user,email | NULL | NULL | NULL | 5 | 36.00 | Using where | +----+--------------------+-------+------------+------+---------------------------+------+---------+------+------+----------+-------------+
Using either column “email” or column “user” works fine, the key is used:
EXPLAIN SELECT c.number, (SELECT COUNT(DISTINCT (id)) FROM c AS c2 WHERE c2.email = c.email) AS ordercount FROM c; +----+--------------------+-------+------------+------+---------------------------+-------+---------+--------------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------------+-------+------------+------+---------------------------+-------+---------+--------------+------+----------+-------------+ | 1 | PRIMARY | c | NULL | ALL | NULL | NULL | NULL | NULL | 5 | 100.00 | NULL | | 2 | DEPENDENT SUBQUERY | c2 | NULL | ref | PRIMARY,number,user,email | email | 767 | test.c.email | 3 | 100.00 | Using index | +----+--------------------+-------+------------+------+---------------------------+-------+---------+--------------+------+----------+-------------+
The problem is that the query runs on large table with about 500.000 entries, making the query taking about 30 seconds only to query a subset of 50 records. Running the query only with the match for “email” or only with the match for “user” it just takes about 1 second for 50 records.
So I need to optimize the query. I tried to change the OR into an UNION:
SELECT c.number, (SELECT COUNT(DISTINCT (id)) FROM ((SELECT u1.id FROM c AS u1 WHERE u1.email = c.email ) UNION DISTINCT (SELECT u2.id FROM c AS u2 WHERE u2.user = c.user )) AS u2 ) AS ordercount FROM c;
but I’m getting the error: ERROR 1054 (42S22): Unknown column ‘c.email’ in ‘where clause’
Any idea how to make this query using indexes to be faster?
Advertisement
Answer
This is an alternative approach using two left join
s:
select c.*, count(distinct coalesce(ce.id, cu.id)) from c left join c ce on c.email = ce.email left join c cu on c.user = cu.user and not cu.email <=> ce.email group by c.id;
This can use separate indexes on c(user)
and c(email)
.
Basically, this joins along the two separate dimensions, and then brings them together for the count(distinct)
. There are some worse case scenarios where there might be lots of matches along both dimensions. However, in many cases this might prove to work quite well because it can use the indexes rather than scanning the entire table for each row.