Skip to content
Advertisement

Count rows that don’t match between two tables within a group

I have two tables, each with a user_id, and group_name columns.

E.g.

table1:

| user_id | group_name1|
------------------------
|    1    |   'groupA' |
|    1    |   'groupB' |
|    2    |   'groupA' |
|    1    |   'groupA' |
------------------------


table2:

| user_id | group_name2|
------------------------
|    1    |   'groupL' |
|    2    |   'groupL' |
|    3    |   'groupL' |
|    4    |   'groupN' |
|    1    |   'groupN' |
|    3    |   'groupN' |
------------------------

I’m trying to create a distribution of counts between the number of times a user in table2 shows up in table1, but do this within a group.

For the example above, I’d get

| times_show_up | number_of_users | group_name1 | group_name2 |
---------------------------------------------------------------
|      0        |       1         |    groupA   |    groupL    |
|      1        |       1         |    groupA   |    groupL    |
|      2        |       1         |    groupA   |    groupL    |
|      0        |       2         |    groupB   |    groupL    |
|      1        |       1         |    groupB   |    groupL    |
|      2        |       0         |    groupB   |    groupL    |
|      0        |       2         |    groupA   |    groupN    |
|      1        |       0         |    groupA   |    groupN    |
|      2        |       1         |    groupA   |    groupN    |
|      0        |       2         |    groupB   |    groupN    |
|      1        |       1         |    groupB   |    groupN    |
|      2        |       0         |    groupB   |    groupN    |
----------------------------------------------------------------

To explain some rows as further example:

For row 1, the user with user_id = 3 in groupL shows up 0 times in groupA. For row 2, the user with user_id = 2 in groupL shows up once in groupA. And for row 3, the user with user_id = 1 in groupL shows up twice in groupA.

Although in this example a person shows up up to a maximum of 2 times, in the real data this number is some arbitrarily large number that I don’t know ahead of time.

Similarly for the other groups if I filled all that out correctly.

I’ve come up with a query that can do all of this except counting the 0s, that looks like this:

    SELECT 
        COUNT(user_id) AS num_users,
        times_show_up,
        group_name1,
        group_name2
    FROM
    (
        SELECT 
            user_id, 
            COUNT(*) AS times_show_up,
            group_name1, 
            group_name2
        FROM
            table1
        RIGHT JOIN
            (SELECT DISTINCT user_id, group_name2 FROM table2)
        USING(user_id)
        GROUP BY user_id, group_name1, group_name2
    )
    GROUP BY times_show_up, group_name1, group_name2

Unfortunately this won’t return the 0 counts in the times_show_up column, and I haven’t figured out a solution that can accomplish this without a lot of subqueries. One possible method is to just run subqueries to grab all the 0s for all combinations of all groups, and then just UNION those rows to the rest of the table. But I want to avoid approaches that include subqueries for each possible group1, group2 combination since the number of groups is very large.

Some limitations include that partition by on this data set tends to run out of memory so I want to avoid it. Updated requirement: In addition, using CROSS JOIN at the individual user level (so cross joining table1 direclty to table2 without grouping up rows first) doesn’t work because each table has 10s of millions of rows.

Finally, rows with 0s in the number_of_users column don’t have to show up (it’s fine if they do since they can be removed with a simple WHERE, but are not necessary, if it helps with the query)


UPDATE:

I was able to come up with a query that can generate the zeros while only requiring a single query for each group_name1, rather than a single query for each group_name1, group_name2 combination. I’m adding it to the question in case it helps come up with answers with yet fewer queries, since it’s still the case that the number of groups in table 1 could be upwards of 20+ which means 20+ queries added via UNION ALL.

SELECT * FROM
    (SELECT 
        times_show_up,
        COUNT(user_id) AS num_users,
        group_name1,
        group_name2
    FROM
    (
        SELECT 
            user_id, 
            COUNT(*) AS times_show_up,
            group_name1, 
            group_name2
        FROM
            table1
        INNER JOIN
            (SELECT DISTINCT user_id, group_name2 FROM table2) t2
        USING(user_id)
        GROUP BY user_id, group_name1, group_name2
    ) t1
    GROUP BY times_show_up, group_name1, group_name2) t9
    UNION ALL
    (SELECT
       0 AS times_show_up,
       SUM(CASE WHEN t1.user_id IS NULL 
           THEN 1 ELSE 0 END) AS num_users,
       'groupA' AS group_name1,
       group_name2
     FROM
       table2
     LEFT JOIN
       (SELECT user_id FROM table1 WHERE group_name1 = 'groupA') t1
     USING(user_id)
     GROUP BY group_name2)
     UNION ALL
    (SELECT
       0 AS times_show_up,
       SUM(CASE WHEN t1.user_id IS NULL 
           THEN 1 ELSE 0 END) AS num_users,
       'groupB' AS group_name1,
       group_name2
     FROM
       table2
     LEFT JOIN
       (SELECT user_id FROM table1 WHERE group_name1 = 'groupB') t1
     USING(user_id)
     GROUP BY group_name2)
     --- ORDER BY group_name1, group_name2, times_show_up

Advertisement

Answer

Below is for BigQuery Standard SQL and ended up being relatively simple

#standardSQL
SELECT times_show_up, 
  COUNT(DISTINCT user_id) number_of_users, 
  group_name1, group_name2
FROM (
  SELECT COUNTIF(a.user_id = b.user_id) times_show_up, 
    b.user_id, 
    group_name1, group_name2
  FROM table1 a
  CROSS JOIN table2 b
  GROUP BY user_id, group_name1, group_name2
)
GROUP BY times_show_up, group_name1, group_name2
-- ORDER BY group_name2, group_name1, times_show_up   

If to apply to sample data from your question – result is

Row times_show_up   number_of_users group_name1 group_name2  
1   0               1               groupA      groupL   
2   1               1               groupA      groupL   
3   2               1               groupA      groupL   
4   0               2               groupB      groupL   
5   1               1               groupB      groupL   
6   0               2               groupA      groupN   
7   2               1               groupA      groupN   
8   0               2               groupB      groupN   
9   1               1               groupB      groupN     

… rows with 0s in the number_of_users column don’t have to show up

Note: I follow this rule as looks like you plan to eliminate them anyway in case if result has such

Update for … each table has 10s of millions of rows.

Try below “optimized” version

#standardSQL
SELECT times_show_up, 
  COUNT(DISTINCT user_id) number_of_users, 
  group_name1, group_name2
FROM (
  SELECT SUM(IF(a.user_id = b.user_id, cnt, 0)) times_show_up, 
    b.user_id, 
    group_name1, group_name2
  FROM (SELECT user_id, group_name1, COUNT(1) cnt FROM table1 GROUP BY user_id, group_name1) a
  CROSS JOIN (SELECT DISTINCT user_id, group_name2 FROM table2) b
  GROUP BY user_id, group_name1, group_name2
)
GROUP BY times_show_up, group_name1, group_name2

I don’t have relevant data to test though whether this will help or not for your specific data

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement