SQL query optimization with 3 joins

Question

I am learning the act of query optimization, in terms of execution time. I have 3 tables which are posts, comments and users. The query below works perfect by returning the Users who have made more ...

Accepted Answer

One thing jumps out at me. Both your subqueries have this sort of line in them. WHERE YEAR(posts.CreationDate) = 2010You call a function on a column value. That’s not sargeable. It prevents MySQL from being able to exploit an index on that column, and instead requires a full scan. (MySQL, and other DBMSs, are still too dumb to know that YEAR(timestamp) could be satisfied with an index range scan.)So change those WHEREs to this kind of thing. WHERE posts.CreationDate >= '2010-01-01' AND posts.CreationDate < '2010-01-01' + INTERVAL 1 YEARand put an index on the CreationDate columns in your posts and comments tables. Then the query planner can random-seek in the index to the first matching row, and then read it sequentially until the last matching row. This is called an index range scan, and is much more efficient than a full table scan.EDIT You need the following indexes:CREATE INDEX date_user ON posts ( CreationDate, OwnerUserId );CREATE INDEX date_user ON comments ( CreationDate, UserID);I suggest you refactor your query to make your subqueries, where all the work happens, faster.These should be the subqueries. They each yield a number of items for each user, in the desired time range. SELECT OwnerUserId, COUNT(*) posts FROM posts WHERE CreationDate >= '2010-01-01' AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR GROUP BY OwnerUserId SELECT UserId, COUNT(*) comments FROM comments WHERE CreationDate >= '2010-01-01' AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR GROUP BY UserIdThese queries save time by only aggregating (summarizing by group) the minimal amount of data needed to satisfy the query. And, they can be satisfied by doing a fast index range scan on the indexes I suggested.Then, you can use these subqueries in your main query, picking up users’ names from your users table, like this.SELECT users.Id user_ID, users.Username, c.comments, p.posts FROM users JOIN ( SELECT OwnerUserId, COUNT(*) posts FROM posts WHERE CreationDate >= '2010-01-01' AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR GROUP BY OwnerUserId ) p ON users.ID = p.OwnerUserId JOIN ( SELECT UserId, COUNT(*) comments FROM comments WHERE CreationDate >= '2010-01-01' AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR GROUP BY UserId ) c ON users.ID = c.UserId WHERE c.comments > p.posts ORDER BY users.ID LIMIT 50;I suspect you’ll get a big performance boost if you add the compound indexes I mentioned. You can drop the indexes on CreationDate; they are redundant when you add the compound indexes.Here’s a worthwhile reference https://use-the-index-luke.com/

Advertisement

Answer