Skip to content
Advertisement

R – get a vector that tells me if a value of another vector is the first appearence or not

I have a data frame of sales with three columns: the code of the customer, the month the customer bought that item, and the year.

A customer can buy something in september and then in december make another purchase, so appear two times. But I’m interested in knowing the absolutely new customoers by month and year.

So I have thought in make an iteration and some checks and use the %in% function and build a boolean vector that tells me if a customer is new or not and then count by month and year with SQL using this new vector.

But I’m wondering if there’s a specific function or a better way to do that.

This is an example of the data I would like to have:

     date   cust  month new_customer
1   14975      25   1    TRUE
2   14976      30   1    TRUE
3   14977      22   1    TRUE
4   14978       4   1    TRUE
5   14979      25   1    FALSE
6   14980      11   1    TRUE
7   14981      17   1    TRUE
8   14982      17   1    FALSE
9   14983      18   1    TRUE
10  14984       7   1    TRUE
11  14985      24   1    TRUE
12  14986      22   1    FALSE

So put it more simple: the data frame is sorted by date, and I’m interested in a vector (new_customer) that tells me if the customer purchased something for the first time or not. For example customer 25 bought something the first day, and then four days later bought something again, so is not a new customer. The same can be seen with customer 17 and 22.

Advertisement

Answer

I create dummy data my self with id, month of numeric format, and year

dat <-data.frame(
  id = c(1,2,3,4,5,6,7,8,1,3,4,5,1,2,2),
  month = c(1,6,7,8,2,3,4,8,11,1,10,9,1,12,2),
  year = c(2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2021,2021,2021,2021,2021)
)

   id month year
1   1     1 2019
2   2     6 2019
3   3     7 2019
4   4     8 2019
5   5     2 2019
6   6     3 2020
7   7     4 2020
8   8     8 2020
9   1    11 2020
10  3     1 2020
11  4    10 2021
12  5     9 2021
13  1     1 2021
14  2    12 2021
15  2     2 2021

Then, group by id and arrange by year and month (order is meaningful). Then use filter and row_number().

dat %>%
  group_by(id) %>%
  arrange(year, month) %>%
  filter(row_number() == 1)

     id month  year
  <dbl> <dbl> <dbl>
1     1     1  2019
2     5     2  2019
3     2     6  2019
4     3     7  2019
5     4     8  2019
6     6     3  2020
7     7     4  2020
8     8     8  2020
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement