How to to create a new identifier in SQL for related addresses that have multiple relationships?

I have a data set like below that includes address and customer_id. In this example multiple customers can ship to the same address and a customer can ship to multiple addresses. I would like to use the relationships between these to create a new ID that joins all of the related addresses and customer_id’s with a new identifier.

Original Table

+-------+-----+
|address| id  |
+-------+-----+
| 11 rd | aa  |
+-------+-----+
| 11 rd | ab  |
+-------+-----+
| 21 dr | ac  |
+-------+-----+
| 21 dr | ab  |
+-------+-----+
| 31 rd | ad  |
+-------+-----+
| 21 dr | abb |
+-------+-----+
| 41 dr | abb |
+-------+-----+

​x
 
+-------+-----+|address| id  |+-------+-----+| 11 rd | aa  |+-------+-----+| 11 rd | ab  |+-------+-----+| 21 dr | ac  |+-------+-----+| 21 dr | ab  |+-------+-----+| 31 rd | ad  |+-------+-----+| 21 dr | abb |+-------+-----+| 41 dr | abb |+-------+-----+​

Desired Output Table

+-------+-----+--------+
|address| id  | new_id |
+-------+-----+--------+
| 11 rd | aa  | 1      |
+-------+-----+--------+
| 11 rd | ab  | 1      |
+-------+-----+--------+
| 21 dr | ac  | 1      |
+-------+-----+--------+
| 21 dr | ab  | 1      |
+-------+-----+--------+
| 31 rd | ad  | 2      |
+-------+-----+--------+
| 21 dr | abb | 1      |
+-------+-----+--------+
| 41 dr | abb | 1      |
+-------+-----+--------+

 
+-------+-----+--------+|address| id  | new_id |+-------+-----+--------+| 11 rd | aa  | 1      |+-------+-----+--------+| 11 rd | ab  | 1      |+-------+-----+--------+| 21 dr | ac  | 1      |+-------+-----+--------+| 21 dr | ab  | 1      |+-------+-----+--------+| 31 rd | ad  | 2      |+-------+-----+--------+| 21 dr | abb | 1      |+-------+-----+--------+| 41 dr | abb | 1      |+-------+-----+--------+​

The challenge is the number of related associations could be infinite and the SQL I have below will only work across two joins.

Here is the SQL I have which works for this small data set but when scaled to even more relationships it would require more joins. Any thoughts on the proper way to structure this would be appreciated!

CREATE TABLE #temp (address char(40), id char(40))
INSERT INTO #temp VALUES ('11 rd', 'aa');
INSERT INTO #temp VALUES ('11 rd', 'ab');
INSERT INTO #temp VALUES ('21 dr', 'ac');
INSERT INTO #temp VALUES ('21 dr', 'ab');
INSERT INTO #temp VALUES ('31 rd', 'ad');
INSERT INTO #temp VALUES ('21 dr', 'abb');
INSERT INTO #temp VALUES ('41 dr', 'abb');

SELECT 
*
,DENSE_RANK() OVER(PARTITION BY id ORDER BY address ASC)as address_rank
,DENSE_RANK() OVER(PARTITION BY address ORDER BY id ASC)as id_rank
INTO #temp2 
FROM #temp 

SELECT a.address,a.id_rank,a.id,b.address as combined_address
INTO #temp3
FROM #temp2 a
LEFT JOIN #temp2 b ON a.id=b.id AND b.address_rank = 1

SELECT 
 a.address
,a.id
,DENSE_RANK() OVER(ORDER BY b.combined_address ASC)as new_id
FROM #temp3 a
LEFT JOIN #temp3 b ON a.combined_address=b.address and b.id_rank = 1

 
CREATE TABLE #temp (address char(40), id char(40))INSERT INTO #temp VALUES ('11 rd', 'aa');INSERT INTO #temp VALUES ('11 rd', 'ab');INSERT INTO #temp VALUES ('21 dr', 'ac');INSERT INTO #temp VALUES ('21 dr', 'ab');INSERT INTO #temp VALUES ('31 rd', 'ad');INSERT INTO #temp VALUES ('21 dr', 'abb');INSERT INTO #temp VALUES ('41 dr', 'abb');​SELECT *,DENSE_RANK() OVER(PARTITION BY id ORDER BY address ASC)as address_rank,DENSE_RANK() OVER(PARTITION BY address ORDER BY id ASC)as id_rankINTO #temp2 FROM #temp ​SELECT a.address,a.id_rank,a.id,b.address as combined_addressINTO #temp3FROM #temp2 aLEFT JOIN #temp2 b ON a.id=b.id AND b.address_rank = 1​SELECT  a.address,a.id,DENSE_RANK() OVER(ORDER BY b.combined_address ASC)as new_idFROM #temp3 aLEFT JOIN #temp3 b ON a.combined_address=b.address and b.id_rank = 1​

Answer

This is a graph-walking problem. I would start by rephrasing the data a bit. It helps for each row to have a unique identifier, which is appropriately called id. Hence, I renamed id as name.

Then, construct the edges and use a recursive CTE to walk through the graph. In SQL Server, this looks like:

with edges as (
      select distinct t1.id as id1, t2.id as id2
      from temp t1 join
           temp t2
           on (t1.address = t2.address or t1.name = t2.name) and (t1.id <> t2.id)
     ),
     cte as (
      select id, id as next_id, convert(varchar(max), concat(',', id, ',')) as visited, 1 as lev
      from temp
      union all
      select cte.id, e.id2, concat(visited, e.id2, ','), lev + 1
      from cte join
           edges e
           on cte.next_id = e.id1
      where visited not like concat('%,', e.id2, ',%') and lev < 10
     )
select id, min(next_id)
from cte
group by id;

 
with edges as (      select distinct t1.id as id1, t2.id as id2      from temp t1 join           temp t2           on (t1.address = t2.address or t1.name = t2.name) and (t1.id <> t2.id)     ),     cte as (      select id, id as next_id, convert(varchar(max), concat(',', id, ',')) as visited, 1 as lev      from temp      union all      select cte.id, e.id2, concat(visited, e.id2, ','), lev + 1      from cte join           edges e           on cte.next_id = e.id1      where visited not like concat('%,', e.id2, ',%') and lev < 10     )select id, min(next_id)from ctegroup by id;​

Here is a db<>fiddle.

Advertisement

Answer