I hope this makes sense – it’s my first post here so I’m sorry if the question is badly formed.
I have tables OldData and NewData:
OldData ID DateFrom DateTo Priority 1 2018-11-01 2018-12-01* 5 1 2018-12-01 2019-02-01 5 2 2017-06-01 2018-03-01 5 2 2018-03-01 2018-04-05* 5 NewData ID DateFrom DateTo Priority 1 2018-11-13 2018-12-01 6 2 2018-03-21 2018-05-01 6
I need merge these tables as below. Where IDs match, dates overlap, and Priority is higher in NewData, I need to update the dates in OldData to reflect NewData.
ID DateFrom DateTo Priority 1 2018-11-01 2018-11-13 5 1 2018-11-13 2018-12-01 6 1 2018-12-01 2019-02-01 5 2 2017-06-01 2018-03-01 5 2 2018-03-01 2018-03-21 5 2 2018-03-21 2018-05-01 6
I first tried to run nested for loops through each table, matching criteria and making changes one at a time, but I’m sure there is a much better way. e.g. possibly using sql in r?
Advertisement
Answer
In general, I interpret this to be an rbind
operation with some cleanup: per-ID
, if there is any overlap in the date ranges, then the lower-priority date range is truncated to match. Though not shown in the data, if you have situations where two higher-priority rows may completely negate a middle row, then you might need to add to the logic (it might then turn into an iterative process).
tidyverse
library(dplyr) out_tidyverse <- bind_rows(OldData, NewData) %>% arrange(ID, DateFrom) %>% group_by(ID) %>% mutate( DateTo = if_else(row_number() < n() & DateTo > lead(DateFrom) & Priority < lead(Priority), lead(DateFrom), DateTo), DateFrom = if_else(row_number() > 1 & DateFrom < lag(DateTo) & Priority < lag(Priority), lag(DateTo), DateFrom) ) %>% ungroup() out_tidyverse # # A tibble: 6 x 4 # ID DateFrom DateTo Priority # <int> <chr> <chr> <int> # 1 1 2018-11-01 2018-11-13 5 # 2 1 2018-11-13 2018-12-01 6 # 3 1 2018-12-01 2019-02-01 5 # 4 2 2017-06-01 2018-03-01 5 # 5 2 2018-03-01 2018-03-21 5 # 6 2 2018-03-21 2018-05-01 6 ### confirm it is the same as your expected output all(mapply(`==`, FinData, out_tidyverse)) # [1] TRUE
data.table
I am using magrittr
here in order to break out the flow in a readable fashion, but it is not required. If you’re comfortable with data.table
by itself, then translating from the magrittr::%>%
to a native data.table
piping should be straight-forward.
Also, I am using as.data.table
instead of the often-preferred side-effect setDT
, primarily so that you don’t use it on your production frame and not realize that many data.frame
operations in R (on those two frames) now behave somewhat differently. If you’re up for using data.table
, then feel free to step around this precaution.
library(data.table) library(magrittr) OldDT <- as.data.table(OldData) NewDT <- as.data.table(NewData) out_DT <- rbind(OldDT, NewDT) %>% .[ order(ID, DateFrom), ] %>% .[, .i := seq_len(.N), by = .(ID) ] %>% .[, DateTo := fifelse(.i < .N & DateTo > shift(DateFrom, type = "lead") & Priority < shift(Priority, type = "lead"), shift(DateFrom, type = "lead"), DateTo), by = .(ID) ] %>% .[, DateFrom := fifelse(.i > 1 & DateFrom < shift(DateTo) & Priority < shift(Priority), shift(DateTo), DateFrom), by = .(ID) ] %>% .[, .i := NULL ] out_DT[] # ID DateFrom DateTo Priority # 1: 1 2018-11-01 2018-11-13 5 # 2: 1 2018-11-13 2018-12-01 6 # 3: 1 2018-12-01 2019-02-01 5 # 4: 2 2017-06-01 2018-03-01 5 # 5: 2 2018-03-01 2018-03-21 5 # 6: 2 2018-03-21 2018-05-01 6 all(mapply(`==`, FinData, out_DT)) # [1] TRUE
Data:
OldData <- read.table(header = TRUE, text=" ID DateFrom DateTo Priority 1 2018-11-01 2018-12-01 5 1 2018-12-01 2019-02-01 5 2 2017-06-01 2018-03-01 5 2 2018-03-01 2018-04-05 5") NewData <- read.table(header = TRUE, text=" ID DateFrom DateTo Priority 1 2018-11-13 2018-12-01 6 2 2018-03-21 2018-05-01 6") FinData <- read.table(header = TRUE, text=" ID DateFrom DateTo Priority 1 2018-11-01 2018-11-13 5 1 2018-11-13 2018-12-01 6 1 2018-12-01 2019-02-01 5 2 2017-06-01 2018-03-01 5 2 2018-03-01 2018-03-21 5 2 2018-03-21 2018-05-01 6")