I have the following table:
CREATE temp TABLE "t_table" ( usr_id bigint, address varchar[], msg_cnt bigint, usr_cnt bigint, source varchar[], last_update timestamp );
Add Data:
INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (1, '{44.154.48.125,81.134.82.111,95.155.38.120,94.134.88.136}', 10, 3, '{src1,src2}', '2019-10-16 22:16:22.163000'); INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (2, '{44.154.48.125}', 10, 3, '{src1,src3}', '2019-10-16 22:16:22.163000'); INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (3, '{94.134.88.136}', 10, 3, '{src1,src4}', '2019-10-16 22:16:22.163000'); INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (4, '{127.0.0.1}', 10, 3, '{src1,src5}', '2019-10-16 22:16:22.163000'); INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (5, '{127.0.0.1,5.5.5.5}', 10, 3, '{src1,src3}', '2019-10-16 22:16:22.163000'); INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (6, '{1.1.0.9}', 10, 3, '{src1,src2}', '2019-10-16 22:16:22.163000');
Find users who share addresses.
Expected Results:
| users | address | sum_msg_cnt | sum_usr_cnt | max_last_date | source | |---------------------------------|-------------------------------------------------------------|--------------|------------------|--------------------------------|-----------------------------| | {1,2,3} | {44.154.48.125,81.134.82.111,95.155.38.120,94.134.88.136} | 30 | 9 | "2019-10-16 22:16:22.163000" | {src4,src1,src2,src3} | | {4,5} | {127.0.0.1,5.5.5.5} | 20 | 6 | "2019-10-16 22:16:22.163000" | {src1,src5,src3} | | {6} | {1.1.0.9} | 10 | 3 | "2019-10-16 22:16:22.163000" | {src1,src2} |
Question:
How do I formulate a SQL query to obtain the expected result?
Much appreciated.
More info:
PostgreSQL 9.5.19
Advertisement
Answer
I don’t know if this is the most efficient method, but I can’t come up with something better right now.
I assume this will have a terrible performance on a larger table.
with userlist as ( select array_agg(t.usr_id) as users, a.address from t_table t left join unnest(t.address) as a(address) on true group by a.address ), shared_users as ( select u.address, array(select distinct ul.uid from userlist u2, unnest(u2.users) as ul(uid) where u.users && u2.users order by ul.uid) as users from userlist u ) select users, array_agg(distinct address) from shared_users group by users;
What does it do?
The first CTE collects all users that share at least one address. The output of the userlist
CTE is:
users | address ------+-------------- {1} | 95.155.38.120 {1,3} | 94.134.88.136 {1,2} | 44.154.48.125 {6} | 1.1.0.9 {4,5} | 127.0.0.1 {1} | 81.134.82.111 {5} | 5.5.5.5
Now this can be used to aggregate those user lists that share at least one address. The output of the shared_users
CTE is:
address | users --------------+-------- 95.155.38.120 | {1,2,3} 94.134.88.136 | {1,2,3} 44.154.48.125 | {1,2,3} 1.1.0.9 | {6} 127.0.0.1 | {4,5} 81.134.82.111 | {1,2,3} 5.5.5.5 | {4,5}
As you can see we now have groups with the same list of usr_ids. In the final step we can group by those and aggregate the addresses, which will then return:
users | array_agg --------+---------------------------------------------------------- {1,2,3} | {44.154.48.125,81.134.82.111,94.134.88.136,95.155.38.120} {4,5} | {127.0.0.1,5.5.5.5} {6} | {1.1.0.9}