I’m hoping that someone can either tell me where my syntax is wrong or let me know if there’s a better way to achieve my goal. I have taken a large dataset and applied a number of identifying fields to later group the data by. I would like to apply a unique ‘Cohort ID’ to the data that I can use to later group the data by. I was thinking that a Row_Number() Over (Partition) type of structure would work but it clearly isn’t for some reason. I am getting Cohort IDs that are different for my specified fields when they should be the same. My code is below:
To further specify what I am trying to do, if all items in my selected fields below are equal for two different rows of data, I want that to be labeled with the same Cohort ID. If the next row is the same as the first two except for a different ‘Issuer’, for example, I want that to get a different Cohort ID.
Data sample below with my expected Cohort ID
Asset_Type Account Issuer Return_Bucket Age_Bucket Type_Bucket Risk_Bucket Cohort ID Equity Client 1 Bank A 10-15 3-6 months Financial Moderate 1 Equity Client 1 Bank A 10-15 3-6 months Financial Moderate 1 Equity Client 1 Bank A 10-15 6-12 months Financial Moderate 2 Equity Client 1 Bank A 10-15 6-12 months Financial Moderate 2 Equity Client 1 Bank A 10-15 6-12 months Financial High 3 Fixed Client 1 Bank A 10-15 6-12 months Financial High 4 Equity Client 1 Bank A 5-10 3-6 months Financial Moderate 5 Select Row_Number() Over(Partition By [Asset_Type], [Account], [Issuer], [Return_Bucket], [Age_Bucket], [Type_Bucket], [Risk_Bucket] Order by [Account]) Account AS CohortID into Test_Table_2 from Test_Table_1
Advertisement
Answer
Your sample results suggest that you want dense_rank()
— no gaps:
Select t.* dense_rank() Over (order by [Asset_Type], [Account], [Issuer], [Return_Bucket], [Age_Bucket], [Type_Bucket], [Risk_Bucket] ) as CohortID into Test_Table_2 from Test_Table_1 t;
The partition by
splits the counting into different groups. That means that 1
will be repeated in each group. The order by
simply orders them.
I assume that you have other columns as well. Duplicates in tables is generally not very useful. I would recommend instead that you remove them.