I have a data set like below that includes address and customer_id. In this example multiple customers can ship to the same address and a customer can ship to multiple addresses. I would like to use the relationships between these to create a new ID that joins all of the related addresses and customer_id’s with a new identifier.
Original Table
+-------+-----+
|address| id |
+-------+-----+
| 11 rd | aa |
+-------+-----+
| 11 rd | ab |
+-------+-----+
| 21 dr | ac |
+-------+-----+
| 21 dr | ab |
+-------+-----+
| 31 rd | ad |
+-------+-----+
| 21 dr | abb |
+-------+-----+
| 41 dr | abb |
+-------+-----+
Desired Output Table
+-------+-----+--------+
|address| id | new_id |
+-------+-----+--------+
| 11 rd | aa | 1 |
+-------+-----+--------+
| 11 rd | ab | 1 |
+-------+-----+--------+
| 21 dr | ac | 1 |
+-------+-----+--------+
| 21 dr | ab | 1 |
+-------+-----+--------+
| 31 rd | ad | 2 |
+-------+-----+--------+
| 21 dr | abb | 1 |
+-------+-----+--------+
| 41 dr | abb | 1 |
+-------+-----+--------+
The challenge is the number of related associations could be infinite and the SQL I have below will only work across two joins.
Here is the SQL I have which works for this small data set but when scaled to even more relationships it would require more joins. Any thoughts on the proper way to structure this would be appreciated!
CREATE TABLE #temp (address char(40), id char(40))
INSERT INTO #temp VALUES ('11 rd', 'aa');
INSERT INTO #temp VALUES ('11 rd', 'ab');
INSERT INTO #temp VALUES ('21 dr', 'ac');
INSERT INTO #temp VALUES ('21 dr', 'ab');
INSERT INTO #temp VALUES ('31 rd', 'ad');
INSERT INTO #temp VALUES ('21 dr', 'abb');
INSERT INTO #temp VALUES ('41 dr', 'abb');
SELECT
*
,DENSE_RANK() OVER(PARTITION BY id ORDER BY address ASC)as address_rank
,DENSE_RANK() OVER(PARTITION BY address ORDER BY id ASC)as id_rank
INTO #temp2
FROM #temp
SELECT a.address,a.id_rank,a.id,b.address as combined_address
INTO #temp3
FROM #temp2 a
LEFT JOIN #temp2 b ON a.id=b.id AND b.address_rank = 1
SELECT
a.address
,a.id
,DENSE_RANK() OVER(ORDER BY b.combined_address ASC)as new_id
FROM #temp3 a
LEFT JOIN #temp3 b ON a.combined_address=b.address and b.id_rank = 1
Advertisement
Answer
This is a graph-walking problem. I would start by rephrasing the data a bit. It helps for each row to have a unique identifier, which is appropriately called id
. Hence, I renamed id
as name
.
Then, construct the edges and use a recursive CTE to walk through the graph. In SQL Server, this looks like:
with edges as (
select distinct t1.id as id1, t2.id as id2
from temp t1 join
temp t2
on (t1.address = t2.address or t1.name = t2.name) and (t1.id <> t2.id)
),
cte as (
select id, id as next_id, convert(varchar(max), concat(',', id, ',')) as visited, 1 as lev
from temp
union all
select cte.id, e.id2, concat(visited, e.id2, ','), lev + 1
from cte join
edges e
on cte.next_id = e.id1
where visited not like concat('%,', e.id2, ',%') and lev < 10
)
select id, min(next_id)
from cte
group by id;
Here is a db<>fiddle.