Imagine I have a table like
ID | Date |
---|---|
1 | 2021-01-01 |
1 | 2021-01-05 |
1 | 2021-01-17 |
1 | 2021-02-01 |
1 | 2021-02-18 |
1 | 2021-02-28 |
1 | 2021-03-30 |
2 | 2021-01-01 |
2 | 2021-01-14 |
2 | 2021-02-15 |
I want to select all data on this table, but creating a new column with a new Event_ID. An Event is defined as all the rows with the same ID, within a time frame of 15 days. The issue is that I want the time frame to move – as in the first 3 rows: row 2 is within the 15 days of row 1 (so they belong to the same event). Row 3 is within 15 days of row 2 (but further apart from row 1), but I want it to be added to the same event as before. (Note: the table is not ordered like in the example, it was just for convenience).
The output should be
ID | Date | Event_ID |
---|---|---|
1 | 2021-01-01 | 1 |
1 | 2021-01-05 | 1 |
1 | 2021-01-17 | 1 |
1 | 2021-02-01 | 1 |
1 | 2021-02-18 | 2 |
1 | 2021-02-28 | 2 |
1 | 2021-03-30 | 3 |
2 | 2021-01-01 | 4 |
2 | 2021-01-14 | 4 |
2 | 2021-02-15 | 5 |
I can also do it in R with data.table (depending on efficiency/performance)
Advertisement
Answer
Here is one data.table
approach in R :
library(data.table) #Change to data.table setDT(df) #Order the dataset setorder(df, ID, Date) #Set flag to TRUE/FALSE if difference is greater than 15 df[, greater_than_15 := c(TRUE, diff(Date) > 15), ID] #Take cumulative sum to create consecutive event id. df[, Event_ID := cumsum(greater_than_15)] df # ID Date greater_than_15 Event_ID # 1: 1 2021-01-01 TRUE 1 # 2: 1 2021-01-05 FALSE 1 # 3: 1 2021-01-17 FALSE 1 # 4: 1 2021-02-01 FALSE 1 # 5: 1 2021-02-18 TRUE 2 # 6: 1 2021-02-28 FALSE 2 # 7: 1 2021-03-30 TRUE 3 # 8: 2 2021-01-01 TRUE 4 # 9: 2 2021-01-14 FALSE 4 #10: 2 2021-02-15 TRUE 5
data
df <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2), Date = structure(c(18628, 18632, 18644, 18659, 18676, 18686, 18716, 18628, 18641, 18673), class = "Date")), row.names = c(NA, -10L), class = "data.frame")