I am working with customer purchase data and am trying to write a query in Google BigQuery that sorts all purchases by date and adds a purchase/order count per customer (order_count). Also, I would like to calculate a time-delay (in days) between the orders of one customer (purchase_latency). My query currently looks like this:
select email, first_name, last_name, order_number, purchase_date, order_price, d.code, from my_order_data left join unnest(discount_codes) as d
The result including “order_count” and “purchase_latency” should look like the following:
email | order_number | purchase_date | order_price | order_count | purchase_latency a@a.com | 34874 | 2020-01-02 16:20:12 UTC | 20,- | 1 | 0 | a@a.com | 43598 | 2020-01-18 12:00:00 UTC | 30,- | 2 | 16 | a@a.com | 47520 | 2020-01-30 08:05:00 UTC | 15,- | 3 | 12 | b@b.com | 23598 | 2019-03-25 22:10:00 UTC | 22,- | 1 | 0 | b@b.com | 25459 | 2019-03-31 17:35:00 UTC | 55,- | 2 | 6 |
How would I add the numbering for “order_count” as well as the calculation for “purchase_latency”?
Thank you very much!
Advertisement
Answer
You can use window functions:
to enumerate the orders of each customer by increasing purchase date, you can use
row_number()
lag()
retrieves the date of the “previous” purchase, and you can compute the difference with the current date withdate_diff()
:
So:
select email, first_name, last_name, order_number, purchase_date, order_price, row_number() over(partition by email order by purchase_date) order_count, date_diff( date(purchase_date), coalesce(date(lag(purchase_date) over(partition by email order by purchase_date)), date(purchase_date)), day ) purchase_latency from my_order_data od left join unnest(od.discount_codes) as dc
Note: I would highly recommend prefixing all the column names in the query with the (alias of the) table they belong to; this makes the query unambiguous, and much easier to follow.