Skip to content
Advertisement

Count rows that don’t match between two tables within a group

I have two tables, each with a user_id, and group_name columns.

E.g.

I’m trying to create a distribution of counts between the number of times a user in table2 shows up in table1, but do this within a group.

For the example above, I’d get

To explain some rows as further example:

For row 1, the user with user_id = 3 in groupL shows up 0 times in groupA. For row 2, the user with user_id = 2 in groupL shows up once in groupA. And for row 3, the user with user_id = 1 in groupL shows up twice in groupA.

Although in this example a person shows up up to a maximum of 2 times, in the real data this number is some arbitrarily large number that I don’t know ahead of time.

Similarly for the other groups if I filled all that out correctly.

I’ve come up with a query that can do all of this except counting the 0s, that looks like this:

Unfortunately this won’t return the 0 counts in the times_show_up column, and I haven’t figured out a solution that can accomplish this without a lot of subqueries. One possible method is to just run subqueries to grab all the 0s for all combinations of all groups, and then just UNION those rows to the rest of the table. But I want to avoid approaches that include subqueries for each possible group1, group2 combination since the number of groups is very large.

Some limitations include that partition by on this data set tends to run out of memory so I want to avoid it. Updated requirement: In addition, using CROSS JOIN at the individual user level (so cross joining table1 direclty to table2 without grouping up rows first) doesn’t work because each table has 10s of millions of rows.

Finally, rows with 0s in the number_of_users column don’t have to show up (it’s fine if they do since they can be removed with a simple WHERE, but are not necessary, if it helps with the query)


UPDATE:

I was able to come up with a query that can generate the zeros while only requiring a single query for each group_name1, rather than a single query for each group_name1, group_name2 combination. I’m adding it to the question in case it helps come up with answers with yet fewer queries, since it’s still the case that the number of groups in table 1 could be upwards of 20+ which means 20+ queries added via UNION ALL.

Advertisement

Answer

Below is for BigQuery Standard SQL and ended up being relatively simple

If to apply to sample data from your question – result is

… rows with 0s in the number_of_users column don’t have to show up

Note: I follow this rule as looks like you plan to eliminate them anyway in case if result has such

Update for … each table has 10s of millions of rows.

Try below “optimized” version

I don’t have relevant data to test though whether this will help or not for your specific data

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement