Skip to content
Advertisement

Need help joining 2 tables and rolling them up in a unique way using BigQuery SQL

I have 2 tables at different granularity in a BigQuery dataset. I need to join those two and roll it up using BigQuery SQL in such a way that the value in one of the columns of the 2nd table becomes the columns in the final table

Table 1 – tb1 looks like this

user_id     event_date
A          2019-02-01
B          2019-02-10
C          2019-01-15

Table 2 – tb2 looks like this

user_id    activity_id   activity_date
A          1             2019-01-01
A          1             2019-02-05
A          2             2019-01-15
B          2             2019-02-02
B          3             2019-02-01
C          1             2019-01-02

I am trying to write a SQL query to create the final table which tells us the number of records for each activity for a user_id where the activity_date is within (event_date – 90 days) for that user_id i.e the activity_date is in the 90 days leading up to the event_date. So, in this case, the output will look like this

user_id   event_date  act_1   act_2    act_3 
A         2019-02-01   1        1        0
B         2019-02-10   0        2        1
C         2019-01-15   1        0        0

act_1 column corresponds to activity_id=1 and so on.

There are some additional complications like :-

    1. The number of distinct activity_ids in Table 2 can change over time. So, I don’t know before hand how many columns will be created in the output table.
    1. I cannot do this in python but has to be done in BQ. This is because the actual table 2 is very large (42TB with 31bn rows) and pulling it out of BQ into another GCS product to run python might be cumbersome.

Any help is appreciated.

Advertisement

Answer

Below is for BigQuery Standard SQL and just to demonstrate you approach of pivoting data

If you would know in advance how many distinct activity_id you have and if the number of such is low – for example three – as in your example – you would do as simple as below

#standardSQL
SELECT 
  user_id,
  event_date,
  COUNTIF(activity_id = 1) act_1,
  COUNTIF(activity_id = 2) act_2,
  COUNTIF(activity_id = 3) act_3
FROM `project.dataset.table1` t1
JOIN `project.dataset.table2` t2
USING(user_id)
GROUP BY user_id, event_date
ORDER BY user_id, event_date   

if to apply to sample data as in your question – result will be

Row user_id event_date  act_1   act_2   act_3    
1   A       2019-02-01  2       1       0    
2   B       2019-02-10  0       1       1    
3   C       2019-01-15  1       0       0      

But as you mentioned

The number of distinct activity_ids in Table 2 can change over time. So, I don’t know before hand how many columns will be created in the output table

So, you need to generate above query dynamically – below is an example of such

#standardSQL
WITH activities AS (
  SELECT DISTINCT activity_id 
  FROM `project.dataset.table2`
), generate_query AS (
  SELECT CONCAT(
    'SELECT user_id, event_date',
    STRING_AGG(CONCAT(',COUNTIF(activity_id = ', CAST(activity_id AS STRING), ') act_', CAST(activity_id AS STRING)), ''),
    ' FROM `project.dataset.table1` t1 JOIN `project.dataset.table2` t2 USING(user_id) GROUP BY user_id, event_date ORDER BY user_id, event_date'
  ) AS query
  FROM activities
)
SELECT query FROM generate_query 

Again, if to apply to your sample data – result will be

SELECT user_id, event_date,COUNTIF(activity_id = 1) act_1,COUNTIF(activity_id = 2) act_2,COUNTIF(activity_id = 3) act_3 FROM `project.dataset.table1` t1 JOIN `project.dataset.table2` t2 USING(user_id) GROUP BY user_id, event_date ORDER BY user_id, event_date   

if to look closer to above result – you can see – it is exactly the query that we initially wrote manually – but now it was generated for us – and no matter how many distinct activity_id you have (obviously limitations on columns number is still apply) – it will product needed query

So, now you just need to copy the text of query from above result and just simply run it – which will produce desired result

Row user_id event_date  act_1   act_2   act_3    
1   A       2019-02-01  2       1       0    
2   B       2019-02-10  0       1       1    
3   C       2019-01-15  1       0       0      

As you can see this is two-steps process – but you can script it using client of your choice

Note: i focused on substance of the question and have not addressed at all 90 days related stuff – I feel it was secondary detail in your question

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement