I have the following table:
CREATE temp TABLE "t_table" (
usr_id bigint,
address varchar[],
msg_cnt bigint,
usr_cnt bigint,
source varchar[],
last_update timestamp
);
Add Data:
INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (1, '{44.154.48.125,81.134.82.111,95.155.38.120,94.134.88.136}', 10, 3, '{src1,src2}', '2019-10-16 22:16:22.163000');
INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (2, '{44.154.48.125}', 10, 3, '{src1,src3}', '2019-10-16 22:16:22.163000');
INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (3, '{94.134.88.136}', 10, 3, '{src1,src4}', '2019-10-16 22:16:22.163000');
INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (4, '{127.0.0.1}', 10, 3, '{src1,src5}', '2019-10-16 22:16:22.163000');
INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (5, '{127.0.0.1,5.5.5.5}', 10, 3, '{src1,src3}', '2019-10-16 22:16:22.163000');
INSERT INTO "t_table"(usr_id, address, msg_cnt, usr_cnt, source, last_update) VALUES (6, '{1.1.0.9}', 10, 3, '{src1,src2}', '2019-10-16 22:16:22.163000');
Find users who share addresses.
Expected Results:
| users | address | sum_msg_cnt | sum_usr_cnt | max_last_date | source |
|---------------------------------|-------------------------------------------------------------|--------------|------------------|--------------------------------|-----------------------------|
| {1,2,3} | {44.154.48.125,81.134.82.111,95.155.38.120,94.134.88.136} | 30 | 9 | "2019-10-16 22:16:22.163000" | {src4,src1,src2,src3} |
| {4,5} | {127.0.0.1,5.5.5.5} | 20 | 6 | "2019-10-16 22:16:22.163000" | {src1,src5,src3} |
| {6} | {1.1.0.9} | 10 | 3 | "2019-10-16 22:16:22.163000" | {src1,src2} |
Question:
How do I formulate a SQL query to obtain the expected result?
Much appreciated.
More info:
PostgreSQL 9.5.19
Advertisement
Answer
I don’t know if this is the most efficient method, but I can’t come up with something better right now.
I assume this will have a terrible performance on a larger table.
with userlist as (
select array_agg(t.usr_id) as users,
a.address
from t_table t
left join unnest(t.address) as a(address) on true
group by a.address
), shared_users as (
select u.address,
array(select distinct ul.uid
from userlist u2, unnest(u2.users) as ul(uid)
where u.users && u2.users
order by ul.uid) as users
from userlist u
)
select users, array_agg(distinct address)
from shared_users
group by users;
What does it do?
The first CTE collects all users that share at least one address. The output of the userlist
CTE is:
users | address
------+--------------
{1} | 95.155.38.120
{1,3} | 94.134.88.136
{1,2} | 44.154.48.125
{6} | 1.1.0.9
{4,5} | 127.0.0.1
{1} | 81.134.82.111
{5} | 5.5.5.5
Now this can be used to aggregate those user lists that share at least one address. The output of the shared_users
CTE is:
address | users
--------------+--------
95.155.38.120 | {1,2,3}
94.134.88.136 | {1,2,3}
44.154.48.125 | {1,2,3}
1.1.0.9 | {6}
127.0.0.1 | {4,5}
81.134.82.111 | {1,2,3}
5.5.5.5 | {4,5}
As you can see we now have groups with the same list of usr_ids. In the final step we can group by those and aggregate the addresses, which will then return:
users | array_agg
--------+----------------------------------------------------------
{1,2,3} | {44.154.48.125,81.134.82.111,94.134.88.136,95.155.38.120}
{4,5} | {127.0.0.1,5.5.5.5}
{6} | {1.1.0.9}