Skip to content
Advertisement

SQL WHERE IN () Performance Optimization

I checked several questions for a duplicate but I couldn’t find one. I am dealing with three tables, the first “articles”, the second “tags”, and the third “article_tags” which contains two foreign keys “articleid” and “tagid”. The table “article_tags” relates articles sharing the same tag together.

My SQL query is as follows:

'Getting 4 tag ids from table article_tags
MyCommand = New SqlCommand("SELECT TOP (4) at.tagid FROM articles a, article_tags at WHERE at.articleid=a.id AND at.articleid=@id", myconnection)
MyCommand.Parameters.Add("@id", SqlDbType.Int).Value = intID
Dim daQuery = New SqlDataAdapter(MyCommand)
Dim dtArticleTags = New DataTable
daQuery.Fill(dtArticleTags)
Dim strTagIds As String = ""
If dtArticleTags.Rows.Count > 0 Then
    'Save all tags id's in a string "strTags"
    '--------------------------------------------- 
    Dim i As Integer = 0
    For Each myrow As DataRow In dtArticleTags.Rows
        'Store the Tag Ids in a string
        strTagIds += myrow.Item("tagid").ToString
        If i <> dtArticleTags.Rows.Count - 1 Then
            strTagIds += ","
        End If
        i += 1
    Next
    '--------------------------------------------- 
    'Getting 5 related articles sharing the same tags (Note: I know that strTagIds is not parametrized but this can never be inputted by a user)
    MyCommand = New SqlCommand("SELECT TOP(5) at.articleid FROM article_tags at, articles a WHERE a.id=at.articleid AND a.publish_flag=1 AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date) AND at.tagid IN (" & strTagIds & ") AND at.articleid<>@id Group by at.articleid, a.publish_date ORDER BY a.publish_date DESC", myconnection)
    MyCommand.Parameters.Add("@id", SqlDbType.Int).Value = intID
    daRelated = New SqlDataAdapter(MyCommand)
    daRelated.Fill(dsArticle, "related")
End If

I feel that the above query is taking time to load especially with the “article_tags” table growing bigger. I am using SQL Express Edition and the tables are indexed and I would like to know if this can be done in a better way to enhance performance.

Attached is the Execution Plan:

enter image description here

Running the below query as suggested by @SuperPoney, returned the same results, and below is the execution plan:

WITH top_art AS
(
    SELECT TOP (4) at.tagid
    FROM articles a, article_tags at
    WHERE at.articleid=a.id
    AND at.articleid=@id
)
SELECT TOP(5) at.articleid
FROM article_tags at, articles a, top_art
WHERE a.id=at.articleid
AND a.publish_flag=1
AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
AND at.tagid=top_art.tag_id
AND at.articleid<>@id
Group by at.articleid, a.publish_date
ORDER BY a.publish_date DESC

enter image description here

Advertisement

Answer

You can get everything you need in a single query:

SELECT  TOP (5) a.ID
FROM    article AS a
WHERE   a.publish_flag = 1 
AND     a.publish_date <  DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())
AND     a.Id <> @ID 
AND     EXISTS 
        (   SELECT  1
            FROM    article_tags AS at
            WHERE   at.ArticleID = a.ID
            AND     EXISTS  
                    (   SELECT  1 
                        FROM    article_tags AS at2
                        WHERE   at2.ArticleID = @ID
                        AND     at2.TagID = at.TagID
                    )
        )
ORDER BY a.publish_date DESC;
                    

I have assumed that you were originally using TOP 4 for tags as an arbitrary limit for performance reasons, as there was no sort. So have ommitted this. I have also changed your predicate from:

SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)

to

a.publish_date <  DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())

The meaning is the same, however by calling the DATEADD/DATEDIFF functions on the run time constants SYSDATETIME() and UTCDATETIME() it means this calculation is only done once, rather than once for every a.publish_date meaning any index on publish_date is now usable.

The other change I have made is to use EXISTS rather than JOIN to link articles to tags. This will avoid duplicates, however it would be equally trivial to remove duplicates using GROUP BY e.g.

SELECT  TOP (5) a.ID
FROM    article AS a
        INNER JOIN article_tags AS at
            ON at.ArticleID = a.ID
WHERE   a.publish_flag = 1 
AND     a.publish_date <  DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())
AND     a.Id <> @ID 
AND     EXISTS  
        (   SELECT  1 
            FROM    article_tags AS at2
            WHERE   at2.ArticleID = @ID
            AND     at2.TagID = at.TagID
        )
GROUP BY a.ID, a.publish_date
ORDER BY a.publish_date DESC;

A few side notes as well that don’t directly relate to the above answer, but are still worth mentioning.

  1. The Implicit join syntax you are using was replaced 28 years ago by ANSI 92 explicit join syntax. There are plenty of good reasons to switch to the “new” syntax, so I would advise you do.
  2. Parameterised queries are about more than just SQL Injection attacks (including but not limited to type safety and query plan caching), so just because your input isn’t coming from a user doesn’t mean you shouldn’t use parametrized queries.
  3. I would strongly advise against re-using your SqlClient objects (SqlConnection, SqlCommand), create a new object for each use, and dispose of it correctly when done.
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement