I have a data set like below that includes address and customer_id. In this example multiple customers can ship to the same address and a customer can ship to multiple addresses. I would like to use the relationships between these to create a new ID that joins all of the related addresses and customer_id’s with a new identifier.
Original Table
+-------+-----+ |address| id | +-------+-----+ | 11 rd | aa | +-------+-----+ | 11 rd | ab | +-------+-----+ | 21 dr | ac | +-------+-----+ | 21 dr | ab | +-------+-----+ | 31 rd | ad | +-------+-----+ | 21 dr | abb | +-------+-----+ | 41 dr | abb | +-------+-----+
Desired Output Table
+-------+-----+--------+ |address| id | new_id | +-------+-----+--------+ | 11 rd | aa | 1 | +-------+-----+--------+ | 11 rd | ab | 1 | +-------+-----+--------+ | 21 dr | ac | 1 | +-------+-----+--------+ | 21 dr | ab | 1 | +-------+-----+--------+ | 31 rd | ad | 2 | +-------+-----+--------+ | 21 dr | abb | 1 | +-------+-----+--------+ | 41 dr | abb | 1 | +-------+-----+--------+
The challenge is the number of related associations could be infinite and the SQL I have below will only work across two joins.
Here is the SQL I have which works for this small data set but when scaled to even more relationships it would require more joins. Any thoughts on the proper way to structure this would be appreciated!
CREATE TABLE #temp (address char(40), id char(40)) INSERT INTO #temp VALUES ('11 rd', 'aa'); INSERT INTO #temp VALUES ('11 rd', 'ab'); INSERT INTO #temp VALUES ('21 dr', 'ac'); INSERT INTO #temp VALUES ('21 dr', 'ab'); INSERT INTO #temp VALUES ('31 rd', 'ad'); INSERT INTO #temp VALUES ('21 dr', 'abb'); INSERT INTO #temp VALUES ('41 dr', 'abb'); SELECT * ,DENSE_RANK() OVER(PARTITION BY id ORDER BY address ASC)as address_rank ,DENSE_RANK() OVER(PARTITION BY address ORDER BY id ASC)as id_rank INTO #temp2 FROM #temp SELECT a.address,a.id_rank,a.id,b.address as combined_address INTO #temp3 FROM #temp2 a LEFT JOIN #temp2 b ON a.id=b.id AND b.address_rank = 1 SELECT a.address ,a.id ,DENSE_RANK() OVER(ORDER BY b.combined_address ASC)as new_id FROM #temp3 a LEFT JOIN #temp3 b ON a.combined_address=b.address and b.id_rank = 1
Advertisement
Answer
This is a graph-walking problem. I would start by rephrasing the data a bit. It helps for each row to have a unique identifier, which is appropriately called id
. Hence, I renamed id
as name
.
Then, construct the edges and use a recursive CTE to walk through the graph. In SQL Server, this looks like:
with edges as ( select distinct t1.id as id1, t2.id as id2 from temp t1 join temp t2 on (t1.address = t2.address or t1.name = t2.name) and (t1.id <> t2.id) ), cte as ( select id, id as next_id, convert(varchar(max), concat(',', id, ',')) as visited, 1 as lev from temp union all select cte.id, e.id2, concat(visited, e.id2, ','), lev + 1 from cte join edges e on cte.next_id = e.id1 where visited not like concat('%,', e.id2, ',%') and lev < 10 ) select id, min(next_id) from cte group by id;
Here is a db<>fiddle.