I’m trying to get a ranked list of Top 5
(i.e. most common) document titles, grouped by decade, for each of the 6 most recent decades.
The document titles are non-unique. There could be dozens or even hundreds of documents with the same title in any given calendar year.
The following query is as far as I’ve been able to go. It gives me the Top 5 titles, but only for the ‘all others’ Decade.
How can I modify the query to get the Top 5 titles for each of the other decades as well?
SELECT Top 5 documentTitle AS 'Title', RANK() OVER (PARTITION BY calendarYear ORDER BY COUNT(documentTitle) DESC) AS Rank, COUNT(tblDocumentFact.inventionTitleEnID) AS 'Number of Occurrences', CASE WHEN calendarYear BETWEEN 2010 AND 2019 THEN '2010 - 2019' WHEN calendarYear BETWEEN 2000 AND 2009 THEN '2000 - 2009' WHEN calendarYear BETWEEN 1990 AND 1999 THEN '1990 - 1999' WHEN calendarYear BETWEEN 1980 AND 1989 THEN '1980 - 1989' WHEN calendarYear BETWEEN 1970 AND 1979 THEN '1970 - 1979' WHEN calendarYear BETWEEN 1960 AND 1969 THEN '1960 - 1969' ELSE 'all others' END AS Decade FROM tbldocumentTitleDimension INNER JOIN tblDocumentFact ON tbldocumentTitleDimension.documentTitleID = tblDocumentFact.documentTitleID INNER JOIN tblDateDimension ON tblDocumentFact.publicationDateID = tblDateDimension.dateID GROUP BY documentTitle, calendarYear ORDER BY [Number of Occurrences] DESC
Advertisement
Answer
If I followed you correctly, you want to top 5 per decade. If so:
you would need to
group by
decade rather than by calendar year to get the proper counts; it is easier to compute the decade in a subquery so you don’t have to repeat thecase
expressionthe rank should be computed over
decade
partitions rather than per yearyou can then use that column to filter in an outer query
Consider:
select * from ( select dtd.documenttitle as title, rank() over (partition by dd.decade order by count(*) desc) as rnk, count(*) as number_of_occurrences, dd.decade from tbldocumentTitleDimension dtd inner join tblDocumentFact df on dtd.documenttitleid = df.documenttitleid inner join ( select dateid, case when calendarYear between 2010 and 2019 then '2010 - 2019' when calendarYear between 2000 and 2009 then '2000 - 2009' when calendarYear between 1990 and 1999 then '1990 - 1999' when calendarYear between 1980 and 1989 then '1980 - 1989' when calendarYear between 1970 and 1979 then '1970 - 1979' when calendarYear between 1960 and 1969 then '1960 - 1969' else 'all others' end AS decade from tblDateDimension ) dd on df.publicationdateid = dd.dateid group by dtd.documenttitle, dd.decade ) t where rnk <= 5 order by decade, number_of_occurrences desc
Side notes:
don’t use single quotes for identifiers (although SQL Server allows that, single quotes should be reserved for litteral stings, as defined in the SQL standard) – better yet, you can use identifiers that do not require quoting
in a multi-table query, always qualify all column names with the table they belong to; I made a few assumptions here
unless you have
null
values in columndocumentTitle
that you don’t want to count in, you can usecount(*)
instead ofcount(documentTitle)
– this is straight-forward, and more efficient