Skip to content
Advertisement

Aggregate data with date ranges

I have the following table (#CategoryWeight) where weight and factor values are stored for each category with date ranges. I need to aggregate / simplify this data, when possible, so that consecutive data ranges are combined into a wide range for the same values of weight and factor.

DROP TABLE IF EXISTS #CategoryWeight;
CREATE TABLE #CategoryWeight ( [CategoryId] bigint, [weight] float(8), [factor] float(8), [startYear] nvarchar(60), [endYear] nvarchar(60) )
INSERT INTO #CategoryWeight
VALUES
( 42, 1, 0, N'2009', N'2014' ), 
( 42, 1, 0, N'2009', N'2019' ), 
( 42, 1, 0, N'2015', N'2017' ), 
( 42, 1, 0, N'2018', N'2019' ), 
( 42, 1, 1, N'2020', N'9999' ),

( 40, 1, 0, N'2009', N'2014' ), 
( 40, 1, 0, N'2009', N'2017' ), 
( 40, 1, 0, N'2015', N'2017' ), 
( 40, 1, 0, N'2020', N'9999' ), 
( 40, 1, 1, N'2018', N'2019' ),

( 45, 1, 0, N'2009', N'2014' ), 
( 45, 0, 0, N'2015', N'2017' ), 
( 45, 1, 0, N'2020', N'9999' ), 
( 45, 0, 1, N'2018', N'2019' );'

CategoryID  weight  factor  startYear   endYear
42          1       0       2009        2014
42          1       0       2009        2019
42          1       0       2015        2017
42          1       0       2018        2019
42          1       1       2020        9999
40          1       0       2009        2014
40          1       0       2009        2017
40          1       0       2015        2017
40          1       0       2020        9999
40          1       1       2018        2019
45          1       0       2009        2014
45          0       0       2015        2017
45          1       0       2020        9999
45          0       1       2018        2019

Expected result:

CategoryID  weight  factor  startYear   endYear
42          1       0       2009        2019
42          1       1       2020        9999
40          1       0       2009        2017
40          1       1       2018        2019
40          1       0       2020        9999
45          1       0       2009        2014
45          0       0       2015        2017
45          0       1       2018        2019
45          1       0       2020        9999

Advertisement

Answer

You have overlapping time periods. That makes any assumption about the data troublesome — because the same year could have different values on different rows (nothing in your question precludes this).

Hence, the approach that I recommend is to expand out the data and then recombine it into chunks where the values are the same. The following uses a recursive CTE to expand the data and then a gaps-and-islands trick to recombine it:

with cte as (
      select categoryid, weight, factor,
             convert(int, startyear) as year, convert(int, endyear) as endyear
      from categoryweight
      union all
      select categoryid, weight, factor,
             year + 1, endyear
      from cte
      where year < endyear
     )
select categoryid, weight, factor, min(year), max(year)
from (select categoryid, weight, factor, year,
             row_number() over (partition by categoryid, weight, factor order by year) as seqnum
      from (select distinct categoryid, weight, factor, year from cte) cte
     ) cte
group by categoryid, weight, factor, (year - seqnum)
order by categoryid, min(year)
option (maxrecursion 0);

Here is a db<>fiddle.

I notice a few things about your data.

  • You are using float for some of the values. This is very dangerous, because two values might look the same but really differ by a very small amount. Use the decimal/numeric type instead, so what-you-see-is-what-you-get.
  • The year values are strings, when they should be integers. Use the proper data types!
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement