I have a data frame of sales with three columns: the code of the customer, the month the customer bought that item, and the year.
A customer can buy something in september and then in december make another purchase, so appear two times. But I’m interested in knowing the absolutely new customoers by month and year.
So I have thought in make an iteration and some checks and use the %in% function and build a boolean vector that tells me if a customer is new or not and then count by month and year with SQL using this new vector.
But I’m wondering if there’s a specific function or a better way to do that.
This is an example of the data I would like to have:
date cust month new_customer 1 14975 25 1 TRUE 2 14976 30 1 TRUE 3 14977 22 1 TRUE 4 14978 4 1 TRUE 5 14979 25 1 FALSE 6 14980 11 1 TRUE 7 14981 17 1 TRUE 8 14982 17 1 FALSE 9 14983 18 1 TRUE 10 14984 7 1 TRUE 11 14985 24 1 TRUE 12 14986 22 1 FALSE
So put it more simple: the data frame is sorted by date, and I’m interested in a vector (new_customer) that tells me if the customer purchased something for the first time or not. For example customer 25 bought something the first day, and then four days later bought something again, so is not a new customer. The same can be seen with customer 17 and 22.
Advertisement
Answer
I create dummy data my self with id
, month
of numeric format, and year
dat <-data.frame( id = c(1,2,3,4,5,6,7,8,1,3,4,5,1,2,2), month = c(1,6,7,8,2,3,4,8,11,1,10,9,1,12,2), year = c(2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2021,2021,2021,2021,2021) ) id month year 1 1 1 2019 2 2 6 2019 3 3 7 2019 4 4 8 2019 5 5 2 2019 6 6 3 2020 7 7 4 2020 8 8 8 2020 9 1 11 2020 10 3 1 2020 11 4 10 2021 12 5 9 2021 13 1 1 2021 14 2 12 2021 15 2 2 2021
Then, group by id
and arrange
by year
and month
(order is meaningful). Then use filter
and row_number()
.
dat %>% group_by(id) %>% arrange(year, month) %>% filter(row_number() == 1) id month year <dbl> <dbl> <dbl> 1 1 1 2019 2 5 2 2019 3 2 6 2019 4 3 7 2019 5 4 8 2019 6 6 3 2020 7 7 4 2020 8 8 8 2020