Skip to content
Advertisement

Create new Event_ID based on ID with sliding window on date column

Imagine I have a table like

ID Date
1 2021-01-01
1 2021-01-05
1 2021-01-17
1 2021-02-01
1 2021-02-18
1 2021-02-28
1 2021-03-30
2 2021-01-01
2 2021-01-14
2 2021-02-15

I want to select all data on this table, but creating a new column with a new Event_ID. An Event is defined as all the rows with the same ID, within a time frame of 15 days. The issue is that I want the time frame to move – as in the first 3 rows: row 2 is within the 15 days of row 1 (so they belong to the same event). Row 3 is within 15 days of row 2 (but further apart from row 1), but I want it to be added to the same event as before. (Note: the table is not ordered like in the example, it was just for convenience).

The output should be

ID Date Event_ID
1 2021-01-01 1
1 2021-01-05 1
1 2021-01-17 1
1 2021-02-01 1
1 2021-02-18 2
1 2021-02-28 2
1 2021-03-30 3
2 2021-01-01 4
2 2021-01-14 4
2 2021-02-15 5

I can also do it in R with data.table (depending on efficiency/performance)

Advertisement

Answer

Here is one data.table approach in R :

library(data.table)
#Change to data.table
setDT(df)
#Order the dataset
setorder(df, ID, Date)
#Set flag to TRUE/FALSE if difference is greater than 15
df[, greater_than_15 := c(TRUE, diff(Date) > 15), ID]
#Take cumulative sum to create consecutive event id.
df[, Event_ID := cumsum(greater_than_15)]
df

#    ID       Date greater_than_15 Event_ID
# 1:  1 2021-01-01            TRUE        1
# 2:  1 2021-01-05           FALSE        1
# 3:  1 2021-01-17           FALSE        1
# 4:  1 2021-02-01           FALSE        1
# 5:  1 2021-02-18            TRUE        2
# 6:  1 2021-02-28           FALSE        2
# 7:  1 2021-03-30            TRUE        3
# 8:  2 2021-01-01            TRUE        4
# 9:  2 2021-01-14           FALSE        4
#10:  2 2021-02-15            TRUE        5

data

df <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2), 
Date = structure(c(18628, 18632, 18644, 18659, 18676, 18686, 18716, 
18628, 18641, 18673), class = "Date")), 
row.names = c(NA, -10L), class = "data.frame")
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement