Skip to content
Advertisement

Top-N By Decade for successive decades (in SQL Server)

I’m trying to get a ranked list of Top 5 (i.e. most common) document titles, grouped by decade, for each of the 6 most recent decades.

The document titles are non-unique. There could be dozens or even hundreds of documents with the same title in any given calendar year.

The following query is as far as I’ve been able to go. It gives me the Top 5 titles, but only for the ‘all others’ Decade.

How can I modify the query to get the Top 5 titles for each of the other decades as well?

SELECT
    Top 5 documentTitle AS 'Title',
    RANK() OVER (PARTITION BY calendarYear ORDER BY COUNT(documentTitle) DESC) AS Rank,
    COUNT(tblDocumentFact.inventionTitleEnID) AS 'Number of Occurrences',
    CASE
        WHEN calendarYear BETWEEN 2010 AND 2019 THEN '2010 - 2019'
        WHEN calendarYear BETWEEN 2000 AND 2009 THEN '2000 - 2009'
        WHEN calendarYear BETWEEN 1990 AND 1999 THEN '1990 - 1999'
        WHEN calendarYear BETWEEN 1980 AND 1989 THEN '1980 - 1989'
        WHEN calendarYear BETWEEN 1970 AND 1979 THEN '1970 - 1979'
        WHEN calendarYear BETWEEN 1960 AND 1969 THEN '1960 - 1969'
        ELSE 'all others'
    END AS Decade
FROM        tbldocumentTitleDimension
INNER JOIN  tblDocumentFact     ON  tbldocumentTitleDimension.documentTitleID   = tblDocumentFact.documentTitleID
INNER JOIN  tblDateDimension    ON  tblDocumentFact.publicationDateID           = tblDateDimension.dateID
GROUP BY    documentTitle,
            calendarYear
ORDER BY    [Number of Occurrences] DESC

Advertisement

Answer

If I followed you correctly, you want to top 5 per decade. If so:

  • you would need to group by decade rather than by calendar year to get the proper counts; it is easier to compute the decade in a subquery so you don’t have to repeat the case expression

  • the rank should be computed over decade partitions rather than per year

  • you can then use that column to filter in an outer query

Consider:

select *
from (
    select
        dtd.documenttitle as title,
        rank() over (partition by dd.decade order by count(*) desc) as rnk,
        count(*) as number_of_occurrences,
        dd.decade
    from tbldocumentTitleDimension dtd
    inner join tblDocumentFact df on dtd.documenttitleid   = df.documenttitleid
    inner join  (
        select 
            dateid,
            case
                when calendarYear between 2010 and 2019 then '2010 - 2019'
                when calendarYear between 2000 and 2009 then '2000 - 2009'
                when calendarYear between 1990 and 1999 then '1990 - 1999'
                when calendarYear between 1980 and 1989 then '1980 - 1989'
                when calendarYear between 1970 and 1979 then '1970 - 1979'
                when calendarYear between 1960 and 1969 then '1960 - 1969'
                else 'all others'
            end AS decade
            from tblDateDimension
    ) dd on df.publicationdateid  = dd.dateid
    group by dtd.documenttitle, dd.decade
) t
where rnk <= 5
order by decade, number_of_occurrences desc

Side notes:

  • don’t use single quotes for identifiers (although SQL Server allows that, single quotes should be reserved for litteral stings, as defined in the SQL standard) – better yet, you can use identifiers that do not require quoting

  • in a multi-table query, always qualify all column names with the table they belong to; I made a few assumptions here

  • unless you have null values in column documentTitle that you don’t want to count in, you can use count(*) instead of count(documentTitle) – this is straight-forward, and more efficient

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement