Skip to content
Advertisement

MySQL GROUP BY slows down query x1000 times

I’m struggling with setting up proper, effective index for my Django application which uses MySQL database. The problem is about article table which for now has a little more than 1 million rows and querying isn’t as fast as we want.

Article table structure looks more or less like below:

Field   Type
id  int
date_published  datetime(6) 
date_retrieved  datetime(6) 
title   varchar(500)    
author  varchar(200)    
content longtext    
source_id   int
online  tinyint(1)
main_article_of_duplicate_group tinyint(1)

After many tries I came that below index gives best performance:

CREATE INDEX search_index ON newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online);

And the problematic query is:

SELECT 
    `newsarticle`.`id`,
    `newsarticle`.`url`,
    `newsarticle`.`date_published`,
    `newsarticle`.`date_retrieved`,
    `newsarticle`.`title`,
    `newsarticle`.`summary_provided`,
    `newsarticle`.`summary_generated`,
    `newsarticle`.`source_id`,
    COUNT(CASE WHEN `newsarticlefeedback`.`is_relevant` THEN `newsarticlefeedback`.`id` ELSE NULL END) AS `count_relevent`,
    COUNT(`newsarticlefeedback`.`id`) AS `count_nonrelevent`,
    (
      SELECT U0.`is_relevant`
      FROM `newsarticlefeedback` U0
      WHERE (U0.`news_id_id` = `newsarticle`.`id` AND U0.`user_id_id` = 27)
      ORDER BY U0.`created_date` DESC
      LIMIT 1
    ) AS `is_relevant`,
    CASE
        WHEN `newsarticle`.`content` = '' THEN 0
        ELSE 1
    END AS `is_content`,
    `newsproviders_newsprovider`.`id`,
    `newsproviders_newsprovider`.`name_long`
FROM
    `newsarticle` USE INDEX (SEARCH_INDEX)
        INNER JOIN
    `newsarticle_topics` ON (`newsarticle`.`id` = `newsarticle_topics`.`newsarticle_id`)
        LEFT OUTER JOIN
    `newsarticlefeedback` ON (`newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`)
        LEFT OUTER JOIN
    `newsproviders_newsprovider` ON (`newsarticle`.`source_id` = `newsproviders_newsprovider`.`id`)
WHERE
    ((1)
        AND `newsarticle`.`main_article_of_duplicate_group`
        AND `newsarticle`.`online`
        AND `newsarticle_topics`.`newstopic_id` = 42
        AND `newsarticle`.`date_published` >= '2020-08-08 08:39:03.199488')
GROUP BY `newsarticle`.`id`
ORDER BY `newsarticle`.`date_published` DESC
LIMIT 30

NOTE: that I have to use the index explicitly, otherwise query is muuch slower. This query takes about 1.4s.

But when I only remove GROUP BY statement the query takes acceptable 1-10ms. I was trying to add newsarticle ID to index at different positions but without a luck.

This is output from EXPLAIN (from Django):

ID  SELECT_TYPE TABLE   PARTITIONS  TYPE    POSSIBLE_KEYS   KEY KEY_LEN REF ROWS    FILTERED    EXTRA
1   PRIMARY newsarticle_topics  None    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk_summarize   newsartic_newstopic_id_ddd996b6_fk_summarize    4   const   312628  100.0   Using temporary; Using filesort
1   PRIMARY newsarticle None    eq_ref  PRIMARY,newsartic_source_id_6ea2b978_fk_summarize,newsartic_topic_id_b67ae2c9_fk_summarize,kek,last_updated,last_update,search_index,fulltext_idx_content   PRIMARY 4   newstech.newsarticle_topics.newsarticle_id  1   22.69   Using where
1   PRIMARY newsarticlefeedback None    ref newsartic_news_id_id_5af7594b_fk_summarize  newsartic_news_id_id_5af7594b_fk_summarize  5   newstech.newsarticle_topics.newsarticle_id  1   100.0   None
1   PRIMARY newsproviders_newsprovider  None    eq_ref  PRIMARY,    PRIMARY 4   newstech.newsarticle.source_id  1   100.0   None
2   DEPENDENT SUBQUERY  U0  None    ref newsartic_news_id_id_5af7594b_fk_summarize,newsartic_user_id_id_fc217cfe_fk_auth_user   newsartic_user_id_id_fc217cfe_fk_auth_user  5   const   1   10.0    Using where; Using filesort

Interesting that same query gives different EXPLAIN in MySQL Workbench and in Django debug toolbar(if you want I can paste EXPLAIN from workbench as well). But the performance is more or less the same. Do you maybe have an idea how to enhance index so it can search quickly?

Thanks

EDIT: I paste here EXPLAIN from MySQL Workbench which is different but seems to be more real (not sure why Django debug toolbar explain differently)

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle NULL    range   PRIMARY,newsartic_source_id_6ea2b978_fk_,newsartic_topic_id_b67ae2c9_fk,kek,last_updated,last_update,search_index,fulltext_idx_content  search_index    8   NULL    227426  81.00   Using index condition; Using MRR; Using temporary; Using filesort
1   PRIMARY newsarticle_topics  NULL    eq_ref  newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq    8   newstech.newsarticle.id,const   1   100.00  Using index
1   PRIMARY newsarticlefeedback NULL    ref newsartic_news_id_id_5af7594b_fk    newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   100.00  NULL
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,newsartic_user_id_id_fc217cfe_fk_auth_user newsartic_user_id_id_fc217cfe_fk_auth_user  5   const   1   10.00   Using where; Using filesort

EDIT2: Below is EXPLAIN when I remove GROUP BY from the query (used MySQL Workbench):

id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,SIMPLE,newsarticle,NULL,range,search_index,search_index,8,NULL,227426,81.00,"Using index condition"
1,SIMPLE,newsarticle_topics,NULL,eq_ref,"newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk",newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,8,"newstech.newsarticle.id,const",1,100.00,"Using index"
1,SIMPLE,newsarticlefeedback,NULL,ref,newsartic_news_id_id_5af7594b_fk,newsartic_news_id_id_5af7594b_fk,5,newstech.newsarticle.id,1,100.00,"Using index"
1,SIMPLE,newsproviders_newsprovider,NULL,eq_ref,"PRIMARY,",PRIMARY,4,newstech.newsarticle.source_id,1,100.00,NULL

EDIT3:

After applying changes suggested by Rick (Thanks!):

newsarticle(id, online, main_article_of_duplicate_group, date_published) two index for newsarticle_topics (newstopic_id, newsarticle_id) and (newsarticle_id, newstopic_id)

WITH USE_INDEX (takes 1.2s)

EXPLAIN:

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle_topics  NULL    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   opposite    4   const   346286  100.00  Using index; Using temporary; Using filesort
1   PRIMARY newsarticle NULL    ref search_index    search_index    4   newstech.newsarticle_topics.newsarticle_id  1   27.00   Using index condition
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   10.00   Using where
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   90.00   Using where

WITHOUT USE_INDEX clause (takes 2.6s)

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle_topics  NULL    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   opposite    4   const   346286  100.00  Using index; Using temporary; Using filesort
1   PRIMARY newsarticle NULL    eq_ref  PRIMARY,search_index    PRIMARY 4   newstech.newsarticle_topics.newsarticle_id  1   27.00   Using where
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   10.00   Using where
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   90.00   Using where

For comparison index – newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online) with USE INDEX (takes only 1-3ms!)

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle NULL    range   search_index    search_index    8   NULL    238876  81.00   Using index condition
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
1   PRIMARY newsarticle_topics  NULL    eq_ref  newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq    8   newstech.newsarticle.id,const   1   100.00  Using index
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  6   newstech.newsarticle.id,const   1   100.00  Using index
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   90.00   Using where; Using index

Advertisement

Answer

Is main_article_of_duplicate_group a true/false flag?

If the Optimizer chooses to start with newsarticle_topics:

 newsarticle_topics:  INDEX(newstopic_id, newsarticle_id)
 newsarticle:  INDEX(newsarticle_id, online,
                     main_article_of_duplicate_group, date_published)

If newsarticle_topics is a many-to-many mapping table, get rid of id and make the PRIMARY KEY be that pair, plus a secondary index in the opposite direction. More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table

If the Optimizer chooses to start with newsarticle (which seems more likely):

 newsarticle_topics:  INDEX(newsarticle_id, newstopic_id)
 newsarticle:  INDEX(online, main_article_of_duplicate_group, date_published)

Meanwhile, newsarticlefeedback needs this, in the order given:

INDEX(news_id_id, user_id_id, created_date, isrelevant)

Instead of

    COUNT(`newsarticlefeedback`.`id`) AS `count_nonrelevent`,
    LEFT OUTER JOIN  `newsarticlefeedback`
          ON (`newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`)

have

    ( SELECT COUNT(*) FROM newsarticlefeedback
          WHERE `newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`
    ) AS `count_nonrelevent`,

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement