I have an interesting issue where I need to create a unique identifier based on match groups for a set of data. This is is based on multiple criteria, but generally what I need to happen is to take this input:
SOURCE_ID | MATCH_ID | PHONE |
---|---|---|
1 | 1 | (999)9999999 |
1 | 2 | (999)9999999 |
2 | 1 | (999)9999999 |
213710 | 707187 | (001)2548987 |
213710 | 759263 | (100)8348243 |
213705 | 2416730 | (156)6676200 |
213705 | 12116102 | (132)3453523 |
And it needs to look like this as the output:
SOURCE_ID | MATCH_ID | PHONE | GENERATED_ID |
---|---|---|---|
1 | 1 | (999)9999999 | 1 |
1 | 2 | (999)9999999 | 1 |
2 | 1 | (999)9999999 | 1 |
213710 | 707187 | (001)2548987 | 2 |
213710 | 759263 | (100)8348243 | 2 |
213705 | 2416730 | (156)6676200 | 3 |
213705 | 12116102 | (132)3453523 | 3 |
I’ve utilized the DENSE_RANK() function to create two separate IDs, one sorting on PHONE, the other SOURCEID column. The PHONE sort gives me the correct output for lines 1-3, but incorrect for 4-7, while the SOURCE_ID sort works on lines 4-6, but not 1-3.
How can I combine these in a way that gives the desired output above? I’ve tried combining the columns in every format possible, but no luck there either.
Output from testing, highlighted correct results. Each TEST## column is noted below
The SQL for reference:
SELECT SOURCE_ID, MATCH_ID, PHONE, DENSE_RANK() OVER(ORDER BY PHONE) AS PHONE_SORT DENSE_RANK() OVER(ORDER BY SOURCE_ID) AS SOURCE_ID_SORT DENSE_RANK() OVER(ORDER BY MATCH_ID, INTERNAL_ROW_ID) AS TEST1, DENSE_RANK() OVER(ORDER BY SOURCE_ID, MATCH_ID) AS TEST2, DENSE_RANK() OVER(ORDER BY MATCH_ID) AS TEST3, DENSE_RANK() OVER(ORDER BY MATCH_ID, SOURCE_ID, PHONE) AS TEST4, DENSE_RANK() OVER(ORDER BY MATCH_ID, PHONE, SOURCE_ID) AS TEST5, DENSE_RANK() OVER(ORDER BY SOURCE_ID, MATCH_ID, PHONE) AS TEST6, DENSE_RANK() OVER(ORDER BY SOURCE_ID, PHONE, MATCH_ID) AS TEST7, DENSE_RANK() OVER(ORDER BY PHONE, SOURCE_ID, MATCH_ID) AS TEST8, DENSE_RANK() OVER(ORDER BY PHONE, MATCH_ID, SOURCE_ID) AS TEST9, DENSE_RANK() OVER(ORDER BY PHONE, SOURCE_ID) AS TEST10, DENSE_RANK() OVER(ORDER BY PHONE, MATCH_ID) AS TEST11, DENSE_RANK() OVER(ORDER BY SOURCE_ID, PHONE) AS TEST12, DENSE_RANK() OVER(ORDER BY MATCH_ID, PHONE) AS TEST13 FROM MY_TABLE;
TIA!
Update — this is working somewhat as expected, but when bringing in additional records (this was a very small subset of the full dataset), starting to run into additional scenarios. Now, trying to make a larger dataset look closer to this:
I used the following code to get almost there but struggling to reassociate those last few records appropriately:
SELECT SOURCE_ID, MATCH_ID, PHONE, DENSE_RANK() OVER(ORDER BY RANKABLE_MATCH_ID) AS GENERATED_ID FROM ( SELECT SOURCE_ID, MATCH_ID, PHONE, COUNT(MATCH_ID) OVER (PARTITION BY MATCH_ID) C_MATCH_ID, IFF(C_PHONE >= C_SOURCE_BY_MATCH AND C_MATCH_ID = C_PHONE, SOURCE_ID::TEXT, RANKABLE_INTERNAL_PHONE) AS RANKABLE_MATCH_ID FROM ( SELECT SOURCE_ID, MATCH_ID, PHONE, COUNT(SOURCE_ID) OVER (PARTITION BY SOURCE_ID) C_SOURCE_ID, COUNT(PHONE) OVER (PARTITION BY PHONE) C_PHONE, COUNT(DISTINCT SOURCE_ID) OVER(PARTITION BY MATCH_ID) C_SOURCE_BY_MATCH, IFF(C_SOURCE_ID > C_PHONE, SOURCE_ID::TEXT, PHONE) AS RANKABLE_INTERNAL_PHONE FROM MY_TABLE ) )
My output based on above code My output based on above code
Advertisement
Answer
that’s some rather obtuse logic.
SELECT column1 ,column2 ,column3 ,dense_rank() over (order by rankable) FROM ( SELECT * ,count(column1) over (partition by column1) c_c1 ,count(column3) over (partition by column3) c_c3 ,iff(c_c1> c_c3, column1::text, column3) as rankable FROM VALUES (1,1,'(999)9999999'), (1, 2,'(999)9999999'), (2, 1,'(999)9999999'), (213710, 707187,'(001)2548987'), (213710, 759263,'(100)8348243'), (213705, 2416730,'(156)6676200'), (213705, 12116102,'(132)3453523') )
gives:
COLUMN1 | COLUMN2 | COLUMN3 | DENSE_RANK() OVER (ORDER BY RANKABLE) |
---|---|---|---|
1 | 1 | (999)9999999 | 1 |
1 | 2 | (999)9999999 | 1 |
2 | 1 | (999)9999999 | 1 |
213,705 | 2,416,730 | (156)6676200 | 2 |
213,705 | 12,116,102 | (132)3453523 | 2 |
213,710 | 707,187 | (001)2548987 | 3 |
213,710 | 759,263 | (100)8348243 | 3 |
A More Complex answer:
So your extended problem, shows you are actually clustering on SETS thus for any SOURCE_ID
all PHONE
‘s are part of the same set, and thus all SOURCE_ID's that are part of the
PHONE`’s set are also in the group. This really should be solved with a recursive CTE to allow for more that 2 steps relationship. Here is a solution that handles 2 layers of chaining..
WITH data AS ( SELECT * FROM VALUES (2, '(999)9999999'), (1, '(999)9999999'), (1, '(999)9999999'), (2, '(999)9999999'), (213705, 'AAA'), (213705, 'AAB'), (213705, 'AAC'), (9624765, 'AAA'), (9624765, 'AAB'), (9624765, 'AAC'), (2175594867, 'AAA'), (2175594867, 'AAB'), (213710, 'BAA'), (213710, 'BAB'), (9213710, 'BAA'), (9213710, 'BAB'), (89213710, 'BAA'), (89213710, 'BAB') ), col1 as ( select column1 ,array_agg(DISTINCT column2) as col2_array from data group by 1 ), col2 as ( select *, row_number() over (order by true) as rn FROM ( select col2_array ,array_agg(DISTINCT column1) as col1_array from col1 group by 1 ) ) SELECT d.column1, d.column2, r.rn FROM data as d JOIN col2 as r on array_contains(d.column1::variant, r.col1_array) and array_contains(d.column2::variant, r.col2_array) ORDER BY 3;
COLUMN1 | COLUMN2 | RN |
---|---|---|
89,213,710 | BAB | 1 |
89,213,710 | BAA | 1 |
9,213,710 | BAB | 1 |
9,213,710 | BAA | 1 |
213,710 | BAB | 1 |
213,710 | BAA | 1 |
1 | (999)9999999 | 2 |
2 | (999)9999999 | 2 |
1 | (999)9999999 | 2 |
2 | (999)9999999 | 2 |
2,175,594,867 | AAA | 3 |
2,175,594,867 | AAB | 3 |
9,624,765 | AAA | 4 |
9,624,765 | AAB | 4 |
9,624,765 | AAC | 4 |
213,705 | AAC | 4 |
213,705 | AAB | 4 |
213,705 | AAA | 4 |