Skip to content
Advertisement

Creating unique row IDs in SQL using partition

I’m hoping that someone can either tell me where my syntax is wrong or let me know if there’s a better way to achieve my goal. I have taken a large dataset and applied a number of identifying fields to later group the data by. I would like to apply a unique ‘Cohort ID’ to the data that I can use to later group the data by. I was thinking that a Row_Number() Over (Partition) type of structure would work but it clearly isn’t for some reason. I am getting Cohort IDs that are different for my specified fields when they should be the same. My code is below:

To further specify what I am trying to do, if all items in my selected fields below are equal for two different rows of data, I want that to be labeled with the same Cohort ID. If the next row is the same as the first two except for a different ‘Issuer’, for example, I want that to get a different Cohort ID.

Data sample below with my expected Cohort ID

Asset_Type  Account Issuer  Return_Bucket   Age_Bucket  Type_Bucket Risk_Bucket Cohort ID
Equity  Client 1    Bank A  10-15   3-6 months  Financial   Moderate    1
Equity  Client 1    Bank A  10-15   3-6 months  Financial   Moderate    1
Equity  Client 1    Bank A  10-15   6-12 months Financial   Moderate    2
Equity  Client 1    Bank A  10-15   6-12 months Financial   Moderate    2
Equity  Client 1    Bank A  10-15   6-12 months Financial   High    3
Fixed   Client 1    Bank A  10-15   6-12 months Financial   High    4
Equity  Client 1    Bank A  5-10    3-6 months  Financial   Moderate    5



Select
Row_Number() Over(Partition By [Asset_Type],
[Account],
[Issuer],
[Return_Bucket],
[Age_Bucket],
[Type_Bucket],
[Risk_Bucket]
Order by [Account]) Account AS CohortID
into Test_Table_2
from Test_Table_1

Advertisement

Answer

Your sample results suggest that you want dense_rank() — no gaps:

Select t.*
       dense_rank() Over (order by [Asset_Type], [Account], [Issuer], [Return_Bucket],
                                   [Age_Bucket], [Type_Bucket], [Risk_Bucket]
                         ) as CohortID
into Test_Table_2
from Test_Table_1 t;

The partition by splits the counting into different groups. That means that 1 will be repeated in each group. The order by simply orders them.

I assume that you have other columns as well. Duplicates in tables is generally not very useful. I would recommend instead that you remove them.

10 People found this is helpful
Advertisement