Skip to content

Creating unique row IDs in SQL using partition

I’m hoping that someone can either tell me where my syntax is wrong or let me know if there’s a better way to achieve my goal. I have taken a large dataset and applied a number of identifying fields to later group the data by. I would like to apply a unique ‘Cohort ID’ to the data that I can use to later group the data by. I was thinking that a Row_Number() Over (Partition) type of structure would work but it clearly isn’t for some reason. I am getting Cohort IDs that are different for my specified fields when they should be the same. My code is below:

To further specify what I am trying to do, if all items in my selected fields below are equal for two different rows of data, I want that to be labeled with the same Cohort ID. If the next row is the same as the first two except for a different ‘Issuer’, for example, I want that to get a different Cohort ID.

Data sample below with my expected Cohort ID

Asset_Type  Account Issuer  Return_Bucket   Age_Bucket  Type_Bucket Risk_Bucket Cohort ID
Equity  Client 1    Bank A  10-15   3-6 months  Financial   Moderate    1
Equity  Client 1    Bank A  10-15   3-6 months  Financial   Moderate    1
Equity  Client 1    Bank A  10-15   6-12 months Financial   Moderate    2
Equity  Client 1    Bank A  10-15   6-12 months Financial   Moderate    2
Equity  Client 1    Bank A  10-15   6-12 months Financial   High    3
Fixed   Client 1    Bank A  10-15   6-12 months Financial   High    4
Equity  Client 1    Bank A  5-10    3-6 months  Financial   Moderate    5



Select
Row_Number() Over(Partition By [Asset_Type],
[Account],
[Issuer],
[Return_Bucket],
[Age_Bucket],
[Type_Bucket],
[Risk_Bucket]
Order by [Account]) Account AS CohortID
into Test_Table_2
from Test_Table_1

Advertisement

Answer

Your sample results suggest that you want dense_rank() — no gaps:

Select t.*
       dense_rank() Over (order by [Asset_Type], [Account], [Issuer], [Return_Bucket],
                                   [Age_Bucket], [Type_Bucket], [Risk_Bucket]
                         ) as CohortID
into Test_Table_2
from Test_Table_1 t;

The partition by splits the counting into different groups. That means that 1 will be repeated in each group. The order by simply orders them.

I assume that you have other columns as well. Duplicates in tables is generally not very useful. I would recommend instead that you remove them.

User contributions licensed under: CC BY-SA
10 People found this is helpful