Skip to content
Advertisement

Split records into buckets based on counts in reference table

This is a simplified version, stripped down to my core problem. I have a ContactData table with millions of rows of data, with each contact record broken up into categories with a ReferenceID. I now have to assign a new UpdatedValue to each contact record, based on counts from the separate NewValues table also by ReferenceID. It doesn’t matter which records are assigned to each group, it could be random or otherwise, as long as the count of records per group is correct.

To use an example from below to illustrate: If there are 800 records in #ContactData with ReferenceID=1, then using the RecordTotal counts in #NewValues, I want to assign 200 to Group1, 350 to Group2 and 250 to Group3.

I could do this with nested loops and updates. But the data in NewValues will be changing regularly and the resulting assignment of Groups to Contacts will change with it. Plus the resulting updated contact data will be dumped into a separate third table rather than updating the original ContactData table. So I’m hoping there is any easier way to assign this value on the fly while selecting the data into the third table. Below are the sample tables and data to illustrate. Any help would be greatly appreciated.

DROP TABLE IF EXISTS #ContactData 
CREATE TABLE #ContactData  ( 
    RowId INT IDENTITY(1,1) NOT NULL, 
    ReferenceID INT,
    FirstName VARCHAR(10)) 
GO
INSERT INTO #ContactData (ReferenceID,FirstName)
VALUES (1,'John'), (1,'Mary'), (1,'Dan'), (2,'Sue'), (2,'Harvey'), (3,'Frank'), (3,'Mike')
GO
DROP TABLE IF EXISTS #NewValues 
CREATE TABLE #NewValues  ( 
    RowId INT IDENTITY(1,1) NOT NULL, 
    ReferenceID INT, 
    RecordTotal DECIMAL(10,4),
    UpdatedValue NVARCHAR(20)
    ) 
GO
INSERT INTO #NewValues (ReferenceID,RecordTotal,UpdatedValue)
VALUES (1,200,'Group1'), (1,350,'Group2'), (1,250,'Group3'), (2,500,'Group4'), (2,300,'Group5'), (3,150,'Group6'), (3,850,'Group7')
GO

Advertisement

Answer

Avoid using loops for such things. Windowed functions are very useful for these kind of issues.

0 – I’ve created a sample data set using your code in a loop:

declare @i int = 1
while(@i <200)
begin

    INSERT INTO #ContactData (ReferenceID,FirstName)
    VALUES (1,'John'), (1,'Mary'), (1,'Dan'), (2,'Sue'), (2,'Harvey'), (3,'Frank'), (3,'Mike')

    set @i = @i + 1
end

1 – Calculate the max and min row values for each ReferenceID:

select
    f1.*
    ,   sum(RecordTotal) over(partition by ReferenceID order by RowId asc) - RecordTotal minValue
    ,   sum(RecordTotal) over(partition by ReferenceID order by RowId asc)  maxValue
from #NewValues f1

2 – Then you need to count each ReferenceID ordering by any column:

select
    *
,   sum(1) over(partition by ReferenceID order by RowId asc) rn    
from #ContactData f1

3 – by using rn calculated in Step 2 you can assign your records to buckets dynamically. Here is the complete code:

select
    g1.*
,   g2.UpdatedValue as Bucket
from 
    (
    select
        *
    ,   sum(1) over(partition by ReferenceID order by RowId asc) rn    
    from #ContactData f1
    ) g1
inner join 
    (
    select
        f1.*
        ,   sum(RecordTotal) over(partition by ReferenceID order by RowId asc) - RecordTotal minValue
        ,   sum(RecordTotal) over(partition by ReferenceID order by RowId asc)  maxValue
    from #NewValues f1
    ) g2 on g1.ReferenceID = g2.ReferenceID and g1.rn >= g2.minValue and g1.rn < g2.maxValue
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement