Selecting values from a nested column based on a condition applied to another nested column in BigQuery

How can I use the index of a “special” value from a nested column (ex: the index of the max value in that nested column) to select a value from another nested column using that index?

As an example, consider a table with the following schema:

Field name	Type	Mode
id	STRING	NULLABLE
username	STRING	NULLABLE
▼ products	RECORD	NULLABLE
▼ list	RECORD	REPEATED
item	STRING	NULLABLE
▼ ordered	RECORD	NULLABLE
▼ list	RECORD	REPEATED
item	INTEGER	NULLABLE
total_orders	STRING	NULLABLE
update_time	TIMESTAMP	NULLABLE
update_id	INTEGER	NULLABLE

The first few rows look like:

Row	id	username	products.list.item	ordered.list.item	total_orders	update_time	update_id
1	1234	a_turing	Apple	1	3	2021-08-14 20:03:22.100846 UTC	121231
			Orange	0
			Pear	2
2	5678	g_hopper	Apple	0	2	2021-08-15 09:36:48.220464 UTC	121232
			Orange	2
			Pear	0
3	1122	a_lovelace	Apple	0	1	2021-08-15 13:59:03.441506 UTC	121233
			Orange	1
			Pear	0
4	3344	v_nabokov	Apple	1	2	2021-08-17 17:34:53.415406 UTC	121234
			Orange	0
			Pear	1

I want to select the most ordered product for each id’s most recent order and exclude orders that don’t have a most ordered product (for instance if a customer ordered the same number of Apple, Orange and Pear).

The query I currently use is a chain of CTEs, one for each product type plus an extra column which is the max number of products ordered by each user (max_ordered). I then join together the CTEs using the id column:

WITH RANKED_ORDERS AS( 
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY update_time DESC) AS rn
FROM mycompany.engagement.products_ordered),

LATEST_ORDERS AS(
SELECT * FROM RANKED_ORDERS WHERE rn = 1),

-- ---------------------- Apples Ordered -----------------------
APPLES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Apple')
ORDER BY offset_nk),

APPLES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as apples_ordered 
FROM APPLES_INDEXED 
ORDER BY
update_time ASC),

-- ---------------------- Oranges Ordered ----------------------
ORANGES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Orange')
ORDER BY offset_nk),

ORANGES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as oranges_ordered 
FROM ORANGES_INDEXED 
ORDER BY
update_time ASC),

-- ---------------------- Pears Ordered -----------------------
PEARS_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Pear')
ORDER BY offset_nk),

PEARS_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as pears_ordered 
FROM PEARS_INDEXED 
ORDER BY
update_time ASC),

-- --------------- Max Product Ordered per Order --------------
MAX_ORDERED AS(
SELECT
id, username, MAX(orders_per_username.item) as max_ordered, total_orders
FROM
LATEST_ORDERS, UNNEST(ordered.list) as orders_per_username
GROUP BY id, username, total_orders),

-- -------------------- Orders In Columns ---------------------
ORDERS_IN_COLUMNS AS(
SELECT APPLES_ORDERED.username, APPLES_ORDERED.update_time, APPLES_ORDERED.apples_ordered,
ORANGES_ORDERED.oranges_ordered, PEARS_ORDERED.pears_ordered, MAX_ORDERED.max_ordered
FROM APPLES_ORDERED
LEFT JOIN ORANGES_ORDERED ON ORANGES_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN PEARS_ORDERED ON PEARS_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN MAX_ORDERED ON MAX_ORDERED.id = APPLES_ORDERED.id),

-- ------- Orders with a most ordered product -----------------
NO_CONFLICTS AS(
SELECT * FROM ORDERS_IN_COLUMNS
WHERE
max_ordered > 0 AND
(
    (apples_ordered not in (oranges_ordered, pears_ordered) AND apples_ordered = max_ordered)
OR
    (oranges_ordered not in (apples_ordered, pears_ordered) AND oranges_ordered = max_ordered)
OR
    (pears_ordered not in (apples_ordered, oranges_ordered) AND pears_ordered = max_ordered)
)
)

SELECT * FROM NO_CONFLICTS

​x
 
WITH RANKED_ORDERS AS( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY update_time DESC) AS rnFROM mycompany.engagement.products_ordered),​LATEST_ORDERS AS(SELECT * FROM RANKED_ORDERS WHERE rn = 1),​-- ---------------------- Apples Ordered -----------------------APPLES_INDEXED AS(SELECT id, username, ordered, flattened_products, offset_nk, update_time, rnFROM LATEST_ORDERSCROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_productsWITH OFFSET as offset_nkWHERE flattened_products.item in ('Apple')ORDER BY offset_nk),​APPLES_ORDERED AS(SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as apples_ordered FROM APPLES_INDEXED ORDER BYupdate_time ASC),​-- ---------------------- Oranges Ordered ----------------------ORANGES_INDEXED AS(SELECT id, username, ordered, flattened_products, offset_nk, update_time, rnFROM LATEST_ORDERSCROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_productsWITH OFFSET as offset_nkWHERE flattened_products.item in ('Orange')ORDER BY offset_nk),​ORANGES_ORDERED AS(SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as oranges_ordered FROM ORANGES_INDEXED ORDER BYupdate_time ASC),​-- ---------------------- Pears Ordered -----------------------PEARS_INDEXED AS(SELECT id, username, ordered, flattened_products, offset_nk, update_time, rnFROM LATEST_ORDERSCROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_productsWITH OFFSET as offset_nkWHERE flattened_products.item in ('Pear')ORDER BY offset_nk),​PEARS_ORDERED AS(SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as pears_ordered FROM PEARS_INDEXED ORDER BYupdate_time ASC),​-- --------------- Max Product Ordered per Order --------------MAX_ORDERED AS(SELECTid, username, MAX(orders_per_username.item) as max_ordered, total_ordersFROMLATEST_ORDERS, UNNEST(ordered.list) as orders_per_usernameGROUP BY id, username, total_orders),​-- -------------------- Orders In Columns ---------------------ORDERS_IN_COLUMNS AS(SELECT APPLES_ORDERED.username, APPLES_ORDERED.update_time, APPLES_ORDERED.apples_ordered,ORANGES_ORDERED.oranges_ordered, PEARS_ORDERED.pears_ordered, MAX_ORDERED.max_orderedFROM APPLES_ORDEREDLEFT JOIN ORANGES_ORDERED ON ORANGES_ORDERED.id = APPLES_ORDERED.idLEFT JOIN PEARS_ORDERED ON PEARS_ORDERED.id = APPLES_ORDERED.idLEFT JOIN MAX_ORDERED ON MAX_ORDERED.id = APPLES_ORDERED.id),​-- ------- Orders with a most ordered product -----------------NO_CONFLICTS AS(SELECT * FROM ORDERS_IN_COLUMNSWHEREmax_ordered > 0 AND(    (apples_ordered not in (oranges_ordered, pears_ordered) AND apples_ordered = max_ordered)OR    (oranges_ordered not in (apples_ordered, pears_ordered) AND oranges_ordered = max_ordered)OR    (pears_ordered not in (apples_ordered, oranges_ordered) AND pears_ordered = max_ordered)))​SELECT * FROM NO_CONFLICTS​

This returns the following table:

Row	username	update_time	apples_ordered	oranges_ordered	pears_ordered	max_ordered
1	a_turing	2021-08-14 20:03:22.100846 UTC	1	0	2	2
2	g_hopper	2021-08-15 09:36:48.220464 UTC	0	2	0	2
3	a_lovelace	2021-08-15 13:59:03.441506 UTC	0	1	0	1

which is expected.
However, I can’t figure out how to simply return a table that looks like:

Row	username	update_time	max_product_ordered
1	a_turing	2021-08-14 20:03:22.100846 UTC	Pear
2	g_hopper	2021-08-15 09:36:48.220464 UTC	Orange
3	a_lovelace	2021-08-15 13:59:03.441506 UTC	Orange

I’m also fairly certain that although this query basically works (I end up doing post-processing in Python to get to the last step) it might be extremely inefficient given the extensive use of “common table expressions”.
Is there a more efficient way to query my BigQuery table than what I’ve written or would I need to completely restructure the table to get any speedup? It currently takes ~10s to run this query on a table with ~10,000 rows and 12 columns and I believe the slowness is due to the multiple CTEs.
I’ve been banging my head against the wall for the past two weeks trying to improve my query and haven’t made much headway. Any help is sincerely appreciated!

Answer

Consider below approach

with latest_orders as (
  select * from `mycompany.engagement.products_ordered`
  where true 
  qualify 1 = row_number() over(partition by id order by update_time desc)
), qualified_items as (
  select *, 
    array(
      select offset from t.ordered.list with offset 
      where true 
      qualify 1 = rank() over(order by item desc) 
    ) items
  from latest_orders t
)
select id, username, update_time,
  products.list[offset(items[offset(0)])] as max_product_ordered,
from qualified_items
where array_length(items) = 1

 
with latest_orders as (  select * from `mycompany.engagement.products_ordered`  where true   qualify 1 = row_number() over(partition by id order by update_time desc)), qualified_items as (  select *,     array(      select offset from t.ordered.list with offset       where true       qualify 1 = rank() over(order by item desc)     ) items  from latest_orders t)select id, username, update_time,  products.list[offset(items[offset(0)])] as max_product_ordered,from qualified_itemswhere array_length(items) = 1    ​

if applied to sample data in your question – output is

Advertisement

Answer