I have 2 tables at different granularity in a BigQuery dataset. I need to join those two and roll it up using BigQuery SQL in such a way that the value in one of the columns of the 2nd table becomes the columns in the final table
Table 1 – tb1 looks like this
user_id event_date A 2019-02-01 B 2019-02-10 C 2019-01-15
Table 2 – tb2 looks like this
user_id activity_id activity_date A 1 2019-01-01 A 1 2019-02-05 A 2 2019-01-15 B 2 2019-02-02 B 3 2019-02-01 C 1 2019-01-02
I am trying to write a SQL query to create the final table which tells us the number of records for each activity for a user_id where the activity_date is within (event_date – 90 days) for that user_id i.e the activity_date is in the 90 days leading up to the event_date. So, in this case, the output will look like this
user_id event_date act_1 act_2 act_3 A 2019-02-01 1 1 0 B 2019-02-10 0 2 1 C 2019-01-15 1 0 0
act_1 column corresponds to activity_id=1 and so on.
There are some additional complications like :-
- The number of distinct activity_ids in Table 2 can change over time. So, I don’t know before hand how many columns will be created in the output table.
- I cannot do this in python but has to be done in BQ. This is because the actual table 2 is very large (42TB with 31bn rows) and pulling it out of BQ into another GCS product to run python might be cumbersome.
Any help is appreciated.
Advertisement
Answer
Below is for BigQuery Standard SQL and just to demonstrate you approach of pivoting data
If you would know in advance how many distinct activity_id you have and if the number of such is low – for example three – as in your example – you would do as simple as below
#standardSQL SELECT user_id, event_date, COUNTIF(activity_id = 1) act_1, COUNTIF(activity_id = 2) act_2, COUNTIF(activity_id = 3) act_3 FROM `project.dataset.table1` t1 JOIN `project.dataset.table2` t2 USING(user_id) GROUP BY user_id, event_date ORDER BY user_id, event_date
if to apply to sample data as in your question – result will be
Row user_id event_date act_1 act_2 act_3 1 A 2019-02-01 2 1 0 2 B 2019-02-10 0 1 1 3 C 2019-01-15 1 0 0
But as you mentioned
The number of distinct activity_ids in Table 2 can change over time. So, I don’t know before hand how many columns will be created in the output table
So, you need to generate above query dynamically – below is an example of such
#standardSQL WITH activities AS ( SELECT DISTINCT activity_id FROM `project.dataset.table2` ), generate_query AS ( SELECT CONCAT( 'SELECT user_id, event_date', STRING_AGG(CONCAT(',COUNTIF(activity_id = ', CAST(activity_id AS STRING), ') act_', CAST(activity_id AS STRING)), ''), ' FROM `project.dataset.table1` t1 JOIN `project.dataset.table2` t2 USING(user_id) GROUP BY user_id, event_date ORDER BY user_id, event_date' ) AS query FROM activities ) SELECT query FROM generate_query
Again, if to apply to your sample data – result will be
SELECT user_id, event_date,COUNTIF(activity_id = 1) act_1,COUNTIF(activity_id = 2) act_2,COUNTIF(activity_id = 3) act_3 FROM `project.dataset.table1` t1 JOIN `project.dataset.table2` t2 USING(user_id) GROUP BY user_id, event_date ORDER BY user_id, event_date
if to look closer to above result – you can see – it is exactly the query that we initially wrote manually – but now it was generated for us – and no matter how many distinct activity_id you have (obviously limitations on columns number is still apply) – it will product needed query
So, now you just need to copy the text of query from above result and just simply run it – which will produce desired result
Row user_id event_date act_1 act_2 act_3 1 A 2019-02-01 2 1 0 2 B 2019-02-10 0 1 1 3 C 2019-01-15 1 0 0
As you can see this is two-steps process – but you can script it using client of your choice
Note: i focused on substance of the question and have not addressed at all 90 days related stuff – I feel it was secondary detail in your question