I’m struggling with setting up proper, effective index for my Django application which uses MySQL database. The problem is about article table which for now has a little more than 1 million rows and querying isn’t as fast as we want.
Article table structure looks more or less like below:
Field Type id int date_published datetime(6) date_retrieved datetime(6) title varchar(500) author varchar(200) content longtext source_id int online tinyint(1) main_article_of_duplicate_group tinyint(1)
After many tries I came that below index gives best performance:
CREATE INDEX search_index ON newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online);
And the problematic query is:
SELECT `newsarticle`.`id`, `newsarticle`.`url`, `newsarticle`.`date_published`, `newsarticle`.`date_retrieved`, `newsarticle`.`title`, `newsarticle`.`summary_provided`, `newsarticle`.`summary_generated`, `newsarticle`.`source_id`, COUNT(CASE WHEN `newsarticlefeedback`.`is_relevant` THEN `newsarticlefeedback`.`id` ELSE NULL END) AS `count_relevent`, COUNT(`newsarticlefeedback`.`id`) AS `count_nonrelevent`, ( SELECT U0.`is_relevant` FROM `newsarticlefeedback` U0 WHERE (U0.`news_id_id` = `newsarticle`.`id` AND U0.`user_id_id` = 27) ORDER BY U0.`created_date` DESC LIMIT 1 ) AS `is_relevant`, CASE WHEN `newsarticle`.`content` = '' THEN 0 ELSE 1 END AS `is_content`, `newsproviders_newsprovider`.`id`, `newsproviders_newsprovider`.`name_long` FROM `newsarticle` USE INDEX (SEARCH_INDEX) INNER JOIN `newsarticle_topics` ON (`newsarticle`.`id` = `newsarticle_topics`.`newsarticle_id`) LEFT OUTER JOIN `newsarticlefeedback` ON (`newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`) LEFT OUTER JOIN `newsproviders_newsprovider` ON (`newsarticle`.`source_id` = `newsproviders_newsprovider`.`id`) WHERE ((1) AND `newsarticle`.`main_article_of_duplicate_group` AND `newsarticle`.`online` AND `newsarticle_topics`.`newstopic_id` = 42 AND `newsarticle`.`date_published` >= '2020-08-08 08:39:03.199488') GROUP BY `newsarticle`.`id` ORDER BY `newsarticle`.`date_published` DESC LIMIT 30
NOTE: that I have to use the index explicitly, otherwise query is muuch slower. This query takes about 1.4s.
But when I only remove GROUP BY statement the query takes acceptable 1-10ms. I was trying to add newsarticle ID to index at different positions but without a luck.
This is output from EXPLAIN (from Django):
ID SELECT_TYPE TABLE PARTITIONS TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS FILTERED EXTRA 1 PRIMARY newsarticle_topics None ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk_summarize newsartic_newstopic_id_ddd996b6_fk_summarize 4 const 312628 100.0 Using temporary; Using filesort 1 PRIMARY newsarticle None eq_ref PRIMARY,newsartic_source_id_6ea2b978_fk_summarize,newsartic_topic_id_b67ae2c9_fk_summarize,kek,last_updated,last_update,search_index,fulltext_idx_content PRIMARY 4 newstech.newsarticle_topics.newsarticle_id 1 22.69 Using where 1 PRIMARY newsarticlefeedback None ref newsartic_news_id_id_5af7594b_fk_summarize newsartic_news_id_id_5af7594b_fk_summarize 5 newstech.newsarticle_topics.newsarticle_id 1 100.0 None 1 PRIMARY newsproviders_newsprovider None eq_ref PRIMARY, PRIMARY 4 newstech.newsarticle.source_id 1 100.0 None 2 DEPENDENT SUBQUERY U0 None ref newsartic_news_id_id_5af7594b_fk_summarize,newsartic_user_id_id_fc217cfe_fk_auth_user newsartic_user_id_id_fc217cfe_fk_auth_user 5 const 1 10.0 Using where; Using filesort
Interesting that same query gives different EXPLAIN in MySQL Workbench and in Django debug toolbar(if you want I can paste EXPLAIN from workbench as well). But the performance is more or less the same. Do you maybe have an idea how to enhance index so it can search quickly?
Thanks
EDIT: I paste here EXPLAIN from MySQL Workbench which is different but seems to be more real (not sure why Django debug toolbar explain differently)
id select_type table partitions type possible_keys key key_len ref rows filtered Extra 1 PRIMARY newsarticle NULL range PRIMARY,newsartic_source_id_6ea2b978_fk_,newsartic_topic_id_b67ae2c9_fk,kek,last_updated,last_update,search_index,fulltext_idx_content search_index 8 NULL 227426 81.00 Using index condition; Using MRR; Using temporary; Using filesort 1 PRIMARY newsarticle_topics NULL eq_ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq 8 newstech.newsarticle.id,const 1 100.00 Using index 1 PRIMARY newsarticlefeedback NULL ref newsartic_news_id_id_5af7594b_fk newsartic_news_id_id_5af7594b_fk 5 newstech.newsarticle.id 1 100.00 NULL 1 PRIMARY newsproviders_newsprovider NULL eq_ref PRIMARY PRIMARY 4 newstech.newsarticle.source_id 1 100.00 NULL 2 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,newsartic_user_id_id_fc217cfe_fk_auth_user newsartic_user_id_id_fc217cfe_fk_auth_user 5 const 1 10.00 Using where; Using filesort
EDIT2: Below is EXPLAIN when I remove GROUP BY from the query (used MySQL Workbench):
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra 1,SIMPLE,newsarticle,NULL,range,search_index,search_index,8,NULL,227426,81.00,"Using index condition" 1,SIMPLE,newsarticle_topics,NULL,eq_ref,"newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk",newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,8,"newstech.newsarticle.id,const",1,100.00,"Using index" 1,SIMPLE,newsarticlefeedback,NULL,ref,newsartic_news_id_id_5af7594b_fk,newsartic_news_id_id_5af7594b_fk,5,newstech.newsarticle.id,1,100.00,"Using index" 1,SIMPLE,newsproviders_newsprovider,NULL,eq_ref,"PRIMARY,",PRIMARY,4,newstech.newsarticle.source_id,1,100.00,NULL
EDIT3:
After applying changes suggested by Rick (Thanks!):
newsarticle(id, online, main_article_of_duplicate_group, date_published) two index for newsarticle_topics (newstopic_id, newsarticle_id) and (newsarticle_id, newstopic_id)
WITH USE_INDEX (takes 1.2s)
EXPLAIN:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra 1 PRIMARY newsarticle_topics NULL ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite opposite 4 const 346286 100.00 Using index; Using temporary; Using filesort 1 PRIMARY newsarticle NULL ref search_index search_index 4 newstech.newsarticle_topics.newsarticle_id 1 27.00 Using index condition 1 PRIMARY newsproviders_newsprovider NULL eq_ref PRIMARY,filter_index PRIMARY 4 newstech.newsarticle.source_id 1 100.00 NULL 4 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index 5 newstech.newsarticle.id 1 100.00 Using filesort 3 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk 5 newstech.newsarticle.id 1 10.00 Using where 2 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk 5 newstech.newsarticle.id 1 90.00 Using where
WITHOUT USE_INDEX clause (takes 2.6s)
id select_type table partitions type possible_keys key key_len ref rows filtered Extra 1 PRIMARY newsarticle_topics NULL ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite opposite 4 const 346286 100.00 Using index; Using temporary; Using filesort 1 PRIMARY newsarticle NULL eq_ref PRIMARY,search_index PRIMARY 4 newstech.newsarticle_topics.newsarticle_id 1 27.00 Using where 1 PRIMARY newsproviders_newsprovider NULL eq_ref PRIMARY,filter_index PRIMARY 4 newstech.newsarticle.source_id 1 100.00 NULL 4 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index 5 newstech.newsarticle.id 1 100.00 Using filesort 3 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk 5 newstech.newsarticle.id 1 10.00 Using where 2 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk 5 newstech.newsarticle.id 1 90.00 Using where
For comparison index – newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online) with USE INDEX (takes only 1-3ms!)
id select_type table partitions type possible_keys key key_len ref rows filtered Extra 1 PRIMARY newsarticle NULL range search_index search_index 8 NULL 238876 81.00 Using index condition 1 PRIMARY newsproviders_newsprovider NULL eq_ref PRIMARY,filter_index PRIMARY 4 newstech.newsarticle.source_id 1 100.00 NULL 1 PRIMARY newsarticle_topics NULL eq_ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq 8 newstech.newsarticle.id,const 1 100.00 Using index 4 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index 5 newstech.newsarticle.id 1 100.00 Using filesort 3 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index 6 newstech.newsarticle.id,const 1 100.00 Using index 2 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index 5 newstech.newsarticle.id 1 90.00 Using where; Using index
Advertisement
Answer
Is main_article_of_duplicate_group
a true/false flag?
If the Optimizer chooses to start with newsarticle_topics
:
newsarticle_topics: INDEX(newstopic_id, newsarticle_id) newsarticle: INDEX(newsarticle_id, online, main_article_of_duplicate_group, date_published)
If newsarticle_topics
is a many-to-many mapping table, get rid of id
and make the PRIMARY KEY
be that pair, plus a secondary index in the opposite direction. More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
If the Optimizer chooses to start with newsarticle
(which seems more likely):
newsarticle_topics: INDEX(newsarticle_id, newstopic_id) newsarticle: INDEX(online, main_article_of_duplicate_group, date_published)
Meanwhile, newsarticlefeedback
needs this, in the order given:
INDEX(news_id_id, user_id_id, created_date, isrelevant)
Instead of
COUNT(`newsarticlefeedback`.`id`) AS `count_nonrelevent`, LEFT OUTER JOIN `newsarticlefeedback` ON (`newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`)
have
( SELECT COUNT(*) FROM newsarticlefeedback WHERE `newsarticle`.`id` = `newsarticlefeedback`.`news_id_id` ) AS `count_nonrelevent`,