I hope this makes sense – it’s my first post here so I’m sorry if the question is badly formed.
I have tables OldData and NewData:
OldData
ID DateFrom DateTo Priority
1 2018-11-01 2018-12-01* 5
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-04-05* 5
NewData
ID DateFrom DateTo Priority
1 2018-11-13 2018-12-01 6
2 2018-03-21 2018-05-01 6
I need merge these tables as below. Where IDs match, dates overlap, and Priority is higher in NewData, I need to update the dates in OldData to reflect NewData.
ID DateFrom DateTo Priority
1 2018-11-01 2018-11-13 5
1 2018-11-13 2018-12-01 6
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-03-21 5
2 2018-03-21 2018-05-01 6
I first tried to run nested for loops through each table, matching criteria and making changes one at a time, but I’m sure there is a much better way. e.g. possibly using sql in r?
Advertisement
Answer
In general, I interpret this to be an rbind
operation with some cleanup: per-ID
, if there is any overlap in the date ranges, then the lower-priority date range is truncated to match. Though not shown in the data, if you have situations where two higher-priority rows may completely negate a middle row, then you might need to add to the logic (it might then turn into an iterative process).
tidyverse
library(dplyr)
out_tidyverse <- bind_rows(OldData, NewData) %>%
arrange(ID, DateFrom) %>%
group_by(ID) %>%
mutate(
DateTo = if_else(row_number() < n() &
DateTo > lead(DateFrom) & Priority < lead(Priority),
lead(DateFrom), DateTo),
DateFrom = if_else(row_number() > 1 &
DateFrom < lag(DateTo) & Priority < lag(Priority),
lag(DateTo), DateFrom)
) %>%
ungroup()
out_tidyverse
# # A tibble: 6 x 4
# ID DateFrom DateTo Priority
# <int> <chr> <chr> <int>
# 1 1 2018-11-01 2018-11-13 5
# 2 1 2018-11-13 2018-12-01 6
# 3 1 2018-12-01 2019-02-01 5
# 4 2 2017-06-01 2018-03-01 5
# 5 2 2018-03-01 2018-03-21 5
# 6 2 2018-03-21 2018-05-01 6
### confirm it is the same as your expected output
all(mapply(`==`, FinData, out_tidyverse))
# [1] TRUE
data.table
I am using magrittr
here in order to break out the flow in a readable fashion, but it is not required. If you’re comfortable with data.table
by itself, then translating from the magrittr::%>%
to a native data.table
piping should be straight-forward.
Also, I am using as.data.table
instead of the often-preferred side-effect setDT
, primarily so that you don’t use it on your production frame and not realize that many data.frame
operations in R (on those two frames) now behave somewhat differently. If you’re up for using data.table
, then feel free to step around this precaution.
library(data.table)
library(magrittr)
OldDT <- as.data.table(OldData)
NewDT <- as.data.table(NewData)
out_DT <- rbind(OldDT, NewDT) %>%
.[ order(ID, DateFrom), ] %>%
.[, .i := seq_len(.N), by = .(ID) ] %>%
.[, DateTo := fifelse(.i < .N &
DateTo > shift(DateFrom, type = "lead") &
Priority < shift(Priority, type = "lead"),
shift(DateFrom, type = "lead"), DateTo),
by = .(ID) ] %>%
.[, DateFrom := fifelse(.i > 1 &
DateFrom < shift(DateTo) &
Priority < shift(Priority),
shift(DateTo), DateFrom),
by = .(ID) ] %>%
.[, .i := NULL ]
out_DT[]
# ID DateFrom DateTo Priority
# 1: 1 2018-11-01 2018-11-13 5
# 2: 1 2018-11-13 2018-12-01 6
# 3: 1 2018-12-01 2019-02-01 5
# 4: 2 2017-06-01 2018-03-01 5
# 5: 2 2018-03-01 2018-03-21 5
# 6: 2 2018-03-21 2018-05-01 6
all(mapply(`==`, FinData, out_DT))
# [1] TRUE
Data:
OldData <- read.table(header = TRUE, text="
ID DateFrom DateTo Priority
1 2018-11-01 2018-12-01 5
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-04-05 5")
NewData <- read.table(header = TRUE, text="
ID DateFrom DateTo Priority
1 2018-11-13 2018-12-01 6
2 2018-03-21 2018-05-01 6")
FinData <- read.table(header = TRUE, text="
ID DateFrom DateTo Priority
1 2018-11-01 2018-11-13 5
1 2018-11-13 2018-12-01 6
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-03-21 5
2 2018-03-21 2018-05-01 6")