Skip to content
Advertisement

Need help joining 2 tables and rolling them up in a unique way using BigQuery SQL

I have 2 tables at different granularity in a BigQuery dataset. I need to join those two and roll it up using BigQuery SQL in such a way that the value in one of the columns of the 2nd table becomes the columns in the final table

Table 1 – tb1 looks like this

Table 2 – tb2 looks like this

I am trying to write a SQL query to create the final table which tells us the number of records for each activity for a user_id where the activity_date is within (event_date – 90 days) for that user_id i.e the activity_date is in the 90 days leading up to the event_date. So, in this case, the output will look like this

act_1 column corresponds to activity_id=1 and so on.

There are some additional complications like :-

    1. The number of distinct activity_ids in Table 2 can change over time. So, I don’t know before hand how many columns will be created in the output table.
    1. I cannot do this in python but has to be done in BQ. This is because the actual table 2 is very large (42TB with 31bn rows) and pulling it out of BQ into another GCS product to run python might be cumbersome.

Any help is appreciated.

Advertisement

Answer

Below is for BigQuery Standard SQL and just to demonstrate you approach of pivoting data

If you would know in advance how many distinct activity_id you have and if the number of such is low – for example three – as in your example – you would do as simple as below

if to apply to sample data as in your question – result will be

But as you mentioned

The number of distinct activity_ids in Table 2 can change over time. So, I don’t know before hand how many columns will be created in the output table

So, you need to generate above query dynamically – below is an example of such

Again, if to apply to your sample data – result will be

if to look closer to above result – you can see – it is exactly the query that we initially wrote manually – but now it was generated for us – and no matter how many distinct activity_id you have (obviously limitations on columns number is still apply) – it will product needed query

So, now you just need to copy the text of query from above result and just simply run it – which will produce desired result

As you can see this is two-steps process – but you can script it using client of your choice

Note: i focused on substance of the question and have not addressed at all 90 days related stuff – I feel it was secondary detail in your question

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement