Skip to content
Advertisement

Merge tables in R and update rows where dates overlap

I hope this makes sense – it’s my first post here so I’m sorry if the question is badly formed.

I have tables OldData and NewData:

OldData

ID DateFrom   DateTo      Priority
1  2018-11-01 2018-12-01* 5
1  2018-12-01 2019-02-01  5
2  2017-06-01 2018-03-01  5
2  2018-03-01 2018-04-05* 5

NewData

ID DateFrom   DateTo      Priority
1  2018-11-13 2018-12-01  6
2  2018-03-21 2018-05-01  6

I need merge these tables as below. Where IDs match, dates overlap, and Priority is higher in NewData, I need to update the dates in OldData to reflect NewData.

ID DateFrom   DateTo     Priority
1  2018-11-01 2018-11-13 5
1  2018-11-13 2018-12-01 6
1  2018-12-01 2019-02-01 5
2  2017-06-01 2018-03-01 5
2  2018-03-01 2018-03-21 5
2  2018-03-21 2018-05-01 6

I first tried to run nested for loops through each table, matching criteria and making changes one at a time, but I’m sure there is a much better way. e.g. possibly using sql in r?

Advertisement

Answer

In general, I interpret this to be an rbind operation with some cleanup: per-ID, if there is any overlap in the date ranges, then the lower-priority date range is truncated to match. Though not shown in the data, if you have situations where two higher-priority rows may completely negate a middle row, then you might need to add to the logic (it might then turn into an iterative process).

tidyverse

library(dplyr)
out_tidyverse <- bind_rows(OldData, NewData) %>%
  arrange(ID, DateFrom) %>%
  group_by(ID) %>%
  mutate(
    DateTo = if_else(row_number() < n() &
                       DateTo > lead(DateFrom) & Priority < lead(Priority),
                     lead(DateFrom), DateTo),
    DateFrom = if_else(row_number() > 1 &
                         DateFrom < lag(DateTo) & Priority < lag(Priority),
                       lag(DateTo), DateFrom)
  ) %>%
  ungroup()
out_tidyverse
# # A tibble: 6 x 4
#      ID DateFrom   DateTo     Priority
#   <int> <chr>      <chr>         <int>
# 1     1 2018-11-01 2018-11-13        5
# 2     1 2018-11-13 2018-12-01        6
# 3     1 2018-12-01 2019-02-01        5
# 4     2 2017-06-01 2018-03-01        5
# 5     2 2018-03-01 2018-03-21        5
# 6     2 2018-03-21 2018-05-01        6

### confirm it is the same as your expected output
all(mapply(`==`, FinData, out_tidyverse))
# [1] TRUE

data.table

I am using magrittr here in order to break out the flow in a readable fashion, but it is not required. If you’re comfortable with data.table by itself, then translating from the magrittr::%>% to a native data.table piping should be straight-forward.

Also, I am using as.data.table instead of the often-preferred side-effect setDT, primarily so that you don’t use it on your production frame and not realize that many data.frame operations in R (on those two frames) now behave somewhat differently. If you’re up for using data.table, then feel free to step around this precaution.

library(data.table)
library(magrittr)
OldDT <- as.data.table(OldData)
NewDT <- as.data.table(NewData)
out_DT <- rbind(OldDT, NewDT) %>%
  .[ order(ID, DateFrom), ] %>%
  .[, .i := seq_len(.N), by = .(ID) ] %>%
  .[, DateTo := fifelse(.i < .N &
                          DateTo > shift(DateFrom, type = "lead") &
                          Priority < shift(Priority, type = "lead"),
                        shift(DateFrom, type = "lead"), DateTo),
    by = .(ID) ] %>%
  .[, DateFrom := fifelse(.i > 1 &
                            DateFrom < shift(DateTo) &
                            Priority < shift(Priority),
                          shift(DateTo), DateFrom),
    by = .(ID) ] %>%
  .[, .i := NULL ]

out_DT[]
#    ID   DateFrom     DateTo Priority
# 1:  1 2018-11-01 2018-11-13        5
# 2:  1 2018-11-13 2018-12-01        6
# 3:  1 2018-12-01 2019-02-01        5
# 4:  2 2017-06-01 2018-03-01        5
# 5:  2 2018-03-01 2018-03-21        5
# 6:  2 2018-03-21 2018-05-01        6

all(mapply(`==`, FinData, out_DT))
# [1] TRUE

Data:

OldData <- read.table(header = TRUE, text="
ID DateFrom   DateTo     Priority
1  2018-11-01 2018-12-01 5
1  2018-12-01 2019-02-01 5
2  2017-06-01 2018-03-01 5
2  2018-03-01 2018-04-05 5")
NewData <- read.table(header = TRUE, text="
ID DateFrom   DateTo     Priority
1  2018-11-13 2018-12-01 6
2  2018-03-21 2018-05-01 6")

FinData <- read.table(header = TRUE, text="
ID DateFrom   DateTo     Priority
1  2018-11-01 2018-11-13 5
1  2018-11-13 2018-12-01 6
1  2018-12-01 2019-02-01 5
2  2017-06-01 2018-03-01 5
2  2018-03-01 2018-03-21 5
2  2018-03-21 2018-05-01 6")
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement