I have two tables, each with a user_id
, and group_name
columns.
E.g.
table1: | user_id | group_name1| ------------------------ | 1 | 'groupA' | | 1 | 'groupB' | | 2 | 'groupA' | | 1 | 'groupA' | ------------------------ table2: | user_id | group_name2| ------------------------ | 1 | 'groupL' | | 2 | 'groupL' | | 3 | 'groupL' | | 4 | 'groupN' | | 1 | 'groupN' | | 3 | 'groupN' | ------------------------
I’m trying to create a distribution of counts between the number of times a user in table2 shows up in table1, but do this within a group.
For the example above, I’d get
| times_show_up | number_of_users | group_name1 | group_name2 | --------------------------------------------------------------- | 0 | 1 | groupA | groupL | | 1 | 1 | groupA | groupL | | 2 | 1 | groupA | groupL | | 0 | 2 | groupB | groupL | | 1 | 1 | groupB | groupL | | 2 | 0 | groupB | groupL | | 0 | 2 | groupA | groupN | | 1 | 0 | groupA | groupN | | 2 | 1 | groupA | groupN | | 0 | 2 | groupB | groupN | | 1 | 1 | groupB | groupN | | 2 | 0 | groupB | groupN | ----------------------------------------------------------------
To explain some rows as further example:
For row 1, the user with user_id = 3 in groupL shows up 0 times in groupA. For row 2, the user with user_id = 2 in groupL shows up once in groupA. And for row 3, the user with user_id = 1 in groupL shows up twice in groupA.
Although in this example a person shows up up to a maximum of 2 times, in the real data this number is some arbitrarily large number that I don’t know ahead of time.
Similarly for the other groups if I filled all that out correctly.
I’ve come up with a query that can do all of this except counting the 0s, that looks like this:
SELECT COUNT(user_id) AS num_users, times_show_up, group_name1, group_name2 FROM ( SELECT user_id, COUNT(*) AS times_show_up, group_name1, group_name2 FROM table1 RIGHT JOIN (SELECT DISTINCT user_id, group_name2 FROM table2) USING(user_id) GROUP BY user_id, group_name1, group_name2 ) GROUP BY times_show_up, group_name1, group_name2
Unfortunately this won’t return the 0 counts in the times_show_up
column, and I haven’t figured out a solution that can accomplish this without a lot of subqueries. One possible method is to just run subqueries to grab all the 0s for all combinations of all groups, and then just UNION
those rows to the rest of the table. But I want to avoid approaches that include subqueries for each possible group1, group2 combination since the number of groups is very large.
Some limitations include that partition by
on this data set tends to run out of memory so I want to avoid it. Updated requirement: In addition, using CROSS JOIN at the individual user level (so cross joining table1 direclty to table2 without grouping up rows first) doesn’t work because each table has 10s of millions of rows.
Finally, rows with 0s in the number_of_users
column don’t have to show up (it’s fine if they do since they can be removed with a simple WHERE
, but are not necessary, if it helps with the query)
UPDATE:
I was able to come up with a query that can generate the zeros while only requiring a single query for each group_name1, rather than a single query for each group_name1, group_name2 combination. I’m adding it to the question in case it helps come up with answers with yet fewer queries, since it’s still the case that the number of groups in table 1 could be upwards of 20+ which means 20+ queries added via UNION ALL
.
SELECT * FROM (SELECT times_show_up, COUNT(user_id) AS num_users, group_name1, group_name2 FROM ( SELECT user_id, COUNT(*) AS times_show_up, group_name1, group_name2 FROM table1 INNER JOIN (SELECT DISTINCT user_id, group_name2 FROM table2) t2 USING(user_id) GROUP BY user_id, group_name1, group_name2 ) t1 GROUP BY times_show_up, group_name1, group_name2) t9 UNION ALL (SELECT 0 AS times_show_up, SUM(CASE WHEN t1.user_id IS NULL THEN 1 ELSE 0 END) AS num_users, 'groupA' AS group_name1, group_name2 FROM table2 LEFT JOIN (SELECT user_id FROM table1 WHERE group_name1 = 'groupA') t1 USING(user_id) GROUP BY group_name2) UNION ALL (SELECT 0 AS times_show_up, SUM(CASE WHEN t1.user_id IS NULL THEN 1 ELSE 0 END) AS num_users, 'groupB' AS group_name1, group_name2 FROM table2 LEFT JOIN (SELECT user_id FROM table1 WHERE group_name1 = 'groupB') t1 USING(user_id) GROUP BY group_name2) --- ORDER BY group_name1, group_name2, times_show_up
Advertisement
Answer
Below is for BigQuery Standard SQL and ended up being relatively simple
#standardSQL SELECT times_show_up, COUNT(DISTINCT user_id) number_of_users, group_name1, group_name2 FROM ( SELECT COUNTIF(a.user_id = b.user_id) times_show_up, b.user_id, group_name1, group_name2 FROM table1 a CROSS JOIN table2 b GROUP BY user_id, group_name1, group_name2 ) GROUP BY times_show_up, group_name1, group_name2 -- ORDER BY group_name2, group_name1, times_show_up
If to apply to sample data from your question – result is
Row times_show_up number_of_users group_name1 group_name2 1 0 1 groupA groupL 2 1 1 groupA groupL 3 2 1 groupA groupL 4 0 2 groupB groupL 5 1 1 groupB groupL 6 0 2 groupA groupN 7 2 1 groupA groupN 8 0 2 groupB groupN 9 1 1 groupB groupN
… rows with 0s in the number_of_users column don’t have to show up
Note: I follow this rule as looks like you plan to eliminate them anyway in case if result has such
Update for … each table has 10s of millions of rows.
Try below “optimized” version
#standardSQL SELECT times_show_up, COUNT(DISTINCT user_id) number_of_users, group_name1, group_name2 FROM ( SELECT SUM(IF(a.user_id = b.user_id, cnt, 0)) times_show_up, b.user_id, group_name1, group_name2 FROM (SELECT user_id, group_name1, COUNT(1) cnt FROM table1 GROUP BY user_id, group_name1) a CROSS JOIN (SELECT DISTINCT user_id, group_name2 FROM table2) b GROUP BY user_id, group_name1, group_name2 ) GROUP BY times_show_up, group_name1, group_name2
I don’t have relevant data to test though whether this will help or not for your specific data