I have a data set that contain two columns [date, cust_id].
date cust_id 2019-12-08 123 2019-12-08 321 2019-12-09 123 2019-12-09 456
There is a high churn rate for my customers and I am trying to create two additional columns [new_cust, left_cust] by counting the numbers of cust_id that are new and have left by day respectively.
In the case I have two tables broken out by day, I have no issues by querying:
count of new customers
SELECT DISTINCT cust_id FROM 2019-12-09 WHERE cust_id NOT IN (SELECT DISTINCT cust_id FROM 2019-12-08)
count of customers who churned
SELECT DISTINCT cust_id FROM 2019-12-08 WHERE cust_id NOT IN (SELECT DISTINCT cust_id FROM 2019-12-09)
I’m not sure how I would query a single table and compare these values by date. What would be the best approach to getting the correct results? I am using AWS Athena.
Expected results:
date new_cust cust_left 2019-12-08 2 0 2019-12-09 1 1
Explanation: Assuming 2019-12-08 is the very first date, I have 2 new customers and 0 customers who have churned. 2019-12-09, I have gained 1 new customer “456”, but have 1 customer “321” who has churned. I would have to apply this to a longer range of dates and cust_id.
Advertisement
Answer
Hmmm. I think you want:
select date, sum(case when prev_date is null then 1 else 0 end) as new_cust, sum(case when next_date = date + interval '1' day then 0 else 1 end) as left_cust from (select t.*, lag(date) over (partition by cust_id order by date) as prev_date, lead(date) over (partition by cust_id order by date) as next_date from t ) t group by date;