I am learning the act of query optimization, in terms of execution time. I have 3 tables which are posts, comments and users. The query below works perfect by returning the Users who have made more comments than posts in the year 2010 and their counts. I believe it can be optimized and I would appreciate an optimized code with an explanation.
Information on each Table
- Users table; Has 40325 rows, with indexing on its primary key(Id).
- Posts table; Has 91986 rows, with indexing on its primary key(PostId) and also the OwnerUserId which is the foreign key that references the users table.
- Comments table – Has 174305 rows, with indexing on its primary key, UserId(for users table) and PostId(for posts table).
SELECT pos_table.user_ID, pos_table.Username, comms, pos from (SELECT users.Id as 'user_ID', users.DisplayName as 'Username', count(posts.Id) as pos FROM users INNER JOIN posts ON posts.OwnerUserId = users.Id WHERE YEAR(posts.CreationDate) = 2010 group by users.Id ) pos_table JOIN (SELECT users.Id as 'user_ID', users.DisplayName as 'Username', count(comments.Id) as comms FROM users INNER JOIN comments ON comments.UserId = users.Id WHERE YEAR(comments.CreationDate) = 2010 group by users.Id ) comms_table on pos_table.user_ID = comms_table.user_ID HAVING comms > pos order by user_ID limit 50;
In my query above, there are two sub-queries. One for posts and the other for comments. I’d like to see how it can be optimized to have a lesser execution time.
The result of the above query, and an attachment of my EXPLAIN query:
Advertisement
Answer
One thing jumps out at me. Both your subqueries have this sort of line in them.
WHERE YEAR(posts.CreationDate) = 2010
You call a function on a column value. That’s not sargeable. It prevents MySQL from being able to exploit an index on that column, and instead requires a full scan. (MySQL, and other DBMSs, are still too dumb to know that YEAR(timestamp)
could be satisfied with an index range scan.)
So change those WHEREs to this kind of thing.
WHERE posts.CreationDate >= '2010-01-01' AND posts.CreationDate < '2010-01-01' + INTERVAL 1 YEAR
and put an index on the CreationDate columns in your posts and comments tables. Then the query planner can random-seek in the index to the first matching row, and then read it sequentially until the last matching row. This is called an index range scan, and is much more efficient than a full table scan.
EDIT You need the following indexes:
CREATE INDEX date_user ON posts ( CreationDate, OwnerUserId ); CREATE INDEX date_user ON comments ( CreationDate, UserID);
I suggest you refactor your query to make your subqueries, where all the work happens, faster.
These should be the subqueries. They each yield a number of items for each user, in the desired time range.
SELECT OwnerUserId, COUNT(*) posts FROM posts WHERE CreationDate >= '2010-01-01' AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR GROUP BY OwnerUserId SELECT UserId, COUNT(*) comments FROM comments WHERE CreationDate >= '2010-01-01' AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR GROUP BY UserId
These queries save time by only aggregating (summarizing by group) the minimal amount of data needed to satisfy the query. And, they can be satisfied by doing a fast index range scan on the indexes I suggested.
Then, you can use these subqueries in your main query, picking up users’ names from your users
table, like this.
SELECT users.Id user_ID, users.Username, c.comments, p.posts FROM users JOIN ( SELECT OwnerUserId, COUNT(*) posts FROM posts WHERE CreationDate >= '2010-01-01' AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR GROUP BY OwnerUserId ) p ON users.ID = p.OwnerUserId JOIN ( SELECT UserId, COUNT(*) comments FROM comments WHERE CreationDate >= '2010-01-01' AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR GROUP BY UserId ) c ON users.ID = c.UserId WHERE c.comments > p.posts ORDER BY users.ID LIMIT 50;
I suspect you’ll get a big performance boost if you add the compound indexes I mentioned. You can drop the indexes on CreationDate; they are redundant when you add the compound indexes.
Here’s a worthwhile reference https://use-the-index-luke.com/