I have purchase data that I’m trying to analyze. I’m attempting to find the 2nd and 3rd purchases (and potentially 4, 5, etc.) for each user.
The data I have currently looks like this:
email, first_purchase_date, processed_date, sku, first_order abc@email.com 6/1/2019 6/1/2019 HG1555 HG1555 abc@email.com 6/1/2019 8/1/2019 RF2655 HG1555 abc@email.com 6/1/2019 8/1/2019 FP7789 HG1555 abc@email.com 6/1/2019 11/1/2019 RF2655 HG1555 def@email.com 7/1/2019 7/1/2019 RF2655 RF2655 def@email.com 7/1/2019 8/1/2019 HG1555 RF2655 def@email.com 7/1/2019 8/1/2019 FP7789 RF2655 xyz@email.com 9/1/2019 6/1/2019 HG1555 HG1555 xyz@email.com 9/1/2019 11/1/2019 FP7789 HG1555
The first order column is looking at what was in the first basket that the customer purchased. Is there a way I can create another column that will tell me exactly what purchase number each line item is? Ideally, the table will look like this:
email, first_purchase_date, processed_date, sku, first_order, order number abc@email.com 6/1/2019 6/1/2019 HG1555 HG1555 1 abc@email.com 6/1/2019 8/1/2019 RF2655 HG1555 2 abc@email.com 6/1/2019 8/1/2019 FP7789 HG1555 2 abc@email.com 6/1/2019 11/1/2019 RF2655 HG1555 3 def@email.com 7/1/2019 7/1/2019 RF2655 RF2655 1 def@email.com 7/1/2019 8/1/2019 HG1555 RF2655 2 def@email.com 7/1/2019 8/1/2019 FP7789 RF2655 2 xyz@email.com 9/1/2019 6/1/2019 HG1555 HG1555 1 xyz@email.com 9/1/2019 11/1/2019 FP7789 HG1555 2
Basically, I want to look at each email and determine where the processed_date falls in respect to all other processed_dates that are registered to that user. The problem I’m trying to work around is when there are multiple line items for a particular processed_date.
Advertisement
Answer
Below is for BigQuery Standard SQL and assumes the processed_date column is of date type
#standardSQL SELECT *, DENSE_RANK() OVER(PARTITION BY email ORDER BY processed_date) order_number FROM `project.dataset.table` -- ORDER BY email, processed_date
If processed_date is a string – you can use below then
#standardSQL SELECT *, DENSE_RANK() OVER(PARTITION BY email ORDER BY PARSE_DATE('%d/%m/%Y', processed_date)) order_number FROM `project.dataset.table` -- ORDER BY email, PARSE_DATE('%d/%m/%Y', processed_date)