Skip to content
Advertisement

Selecting values from a nested column based on a condition applied to another nested column in BigQuery

How can I use the index of a “special” value from a nested column (ex: the index of the max value in that nested column) to select a value from another nested column using that index?

As an example, consider a table with the following schema:

Field name Type Mode
id STRING NULLABLE
username STRING NULLABLE
▼ products RECORD NULLABLE
     ▼ list RECORD REPEATED
            item STRING NULLABLE
▼ ordered RECORD NULLABLE
     ▼ list RECORD REPEATED
            item INTEGER NULLABLE
total_orders STRING NULLABLE
update_time TIMESTAMP NULLABLE
update_id INTEGER NULLABLE

The first few rows look like:

Row id username products.list.item ordered.list.item total_orders update_time update_id
1 1234 a_turing Apple 1 3 2021-08-14 20:03:22.100846 UTC 121231
      Orange 0      
      Pear 2      
2 5678 g_hopper Apple 0 2 2021-08-15 09:36:48.220464 UTC 121232
      Orange 2      
      Pear 0      
3 1122 a_lovelace Apple 0 1 2021-08-15 13:59:03.441506 UTC 121233
      Orange 1      
      Pear 0      
4 3344 v_nabokov Apple 1 2 2021-08-17 17:34:53.415406 UTC 121234
      Orange 0      
      Pear 1      

I want to select the most ordered product for each id’s most recent order and exclude orders that don’t have a most ordered product (for instance if a customer ordered the same number of Apple, Orange and Pear).

The query I currently use is a chain of CTEs, one for each product type plus an extra column which is the max number of products ordered by each user (max_ordered). I then join together the CTEs using the id column:

This returns the following table:

Row username update_time apples_ordered oranges_ordered pears_ordered max_ordered
1 a_turing 2021-08-14 20:03:22.100846 UTC 1 0 2 2
2 g_hopper 2021-08-15 09:36:48.220464 UTC 0 2 0 2
3 a_lovelace 2021-08-15 13:59:03.441506 UTC 0 1 0 1

which is expected.
However, I can’t figure out how to simply return a table that looks like:

Row username update_time max_product_ordered
1 a_turing 2021-08-14 20:03:22.100846 UTC Pear
2 g_hopper 2021-08-15 09:36:48.220464 UTC Orange
3 a_lovelace 2021-08-15 13:59:03.441506 UTC Orange

I’m also fairly certain that although this query basically works (I end up doing post-processing in Python to get to the last step) it might be extremely inefficient given the extensive use of “common table expressions”.
Is there a more efficient way to query my BigQuery table than what I’ve written or would I need to completely restructure the table to get any speedup? It currently takes ~10s to run this query on a table with ~10,000 rows and 12 columns and I believe the slowness is due to the multiple CTEs.
I’ve been banging my head against the wall for the past two weeks trying to improve my query and haven’t made much headway. Any help is sincerely appreciated!

Advertisement

Answer

Consider below approach

if applied to sample data in your question – output is

enter image description here

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement