I’m trying to get a ranked list of Top 5
(i.e. most common) document titles, grouped by decade, for each of the 6 most recent decades.
The document titles are non-unique. There could be dozens or even hundreds of documents with the same title in any given calendar year.
The following query is as far as I’ve been able to go. It gives me the Top 5 titles, but only for the ‘all others’ Decade.
How can I modify the query to get the Top 5 titles for each of the other decades as well?
SELECT
Top 5 documentTitle AS 'Title',
RANK() OVER (PARTITION BY calendarYear ORDER BY COUNT(documentTitle) DESC) AS Rank,
COUNT(tblDocumentFact.inventionTitleEnID) AS 'Number of Occurrences',
CASE
WHEN calendarYear BETWEEN 2010 AND 2019 THEN '2010 - 2019'
WHEN calendarYear BETWEEN 2000 AND 2009 THEN '2000 - 2009'
WHEN calendarYear BETWEEN 1990 AND 1999 THEN '1990 - 1999'
WHEN calendarYear BETWEEN 1980 AND 1989 THEN '1980 - 1989'
WHEN calendarYear BETWEEN 1970 AND 1979 THEN '1970 - 1979'
WHEN calendarYear BETWEEN 1960 AND 1969 THEN '1960 - 1969'
ELSE 'all others'
END AS Decade
FROM tbldocumentTitleDimension
INNER JOIN tblDocumentFact ON tbldocumentTitleDimension.documentTitleID = tblDocumentFact.documentTitleID
INNER JOIN tblDateDimension ON tblDocumentFact.publicationDateID = tblDateDimension.dateID
GROUP BY documentTitle,
calendarYear
ORDER BY [Number of Occurrences] DESC
Advertisement
Answer
If I followed you correctly, you want to top 5 per decade. If so:
you would need to
group by
decade rather than by calendar year to get the proper counts; it is easier to compute the decade in a subquery so you don’t have to repeat thecase
expressionthe rank should be computed over
decade
partitions rather than per yearyou can then use that column to filter in an outer query
Consider:
select *
from (
select
dtd.documenttitle as title,
rank() over (partition by dd.decade order by count(*) desc) as rnk,
count(*) as number_of_occurrences,
dd.decade
from tbldocumentTitleDimension dtd
inner join tblDocumentFact df on dtd.documenttitleid = df.documenttitleid
inner join (
select
dateid,
case
when calendarYear between 2010 and 2019 then '2010 - 2019'
when calendarYear between 2000 and 2009 then '2000 - 2009'
when calendarYear between 1990 and 1999 then '1990 - 1999'
when calendarYear between 1980 and 1989 then '1980 - 1989'
when calendarYear between 1970 and 1979 then '1970 - 1979'
when calendarYear between 1960 and 1969 then '1960 - 1969'
else 'all others'
end AS decade
from tblDateDimension
) dd on df.publicationdateid = dd.dateid
group by dtd.documenttitle, dd.decade
) t
where rnk <= 5
order by decade, number_of_occurrences desc
Side notes:
don’t use single quotes for identifiers (although SQL Server allows that, single quotes should be reserved for litteral stings, as defined in the SQL standard) – better yet, you can use identifiers that do not require quoting
in a multi-table query, always qualify all column names with the table they belong to; I made a few assumptions here
unless you have
null
values in columndocumentTitle
that you don’t want to count in, you can usecount(*)
instead ofcount(documentTitle)
– this is straight-forward, and more efficient