I am working with the R Programming Language. I have the following tables (note: all variables appear as “Factors”):
table_1 = data.frame(id = c("123", "123", "125", "C125-B"), date_1 = c("2010-01-31","2010-01-31", "2016-01-31", "2018-01-31" )) table_1$id = as.factor(table_1$id) table_1$date_1 = as.factor(table_1$date_1) table_2 = data.frame(id = c("5123", "123 A", "125", "125"), date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ), date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" )) table_2$id = as.factor(table_2$id) table_2$date_2 = as.factor(table_2$date_2) table_2$date_3 = as.factor(table_2$date_3) > table_1 id date_1 1 123 2010-01-31 2 123 2010-01-31 3 125 2016-01-31 4 C125-B 2018-01-31 table_2 id date_2 date_3 1 5123 2009-01-31 2011-01-31 2 123 A 2010-01-31 2010-01-31 3 125 2010-01-31 2020-01-31 4 125 2010-01-31 2020-01-31
I am trying to “join” (e.g. inner join) this tables on the following conditions:
1) if table_1$id "fuzzy equal" table_2$id
AND
2) if table_1$date BETWEEN(table_2$date_2,table_2$date_3)
I tried to write the following code in R to do this:
library(fuzzyjoin) stringdist_inner_join(table_1, table_2, by ="id", distance_col = NULL)
Question: But I am not sure if the stringdist_inner_join
function can accommodate this kind of “between” logic.
Can someone please show me how to do this? Are there any other methods to accomplish this in R?
Thanks!
Advertisement
Answer
How about this? We could do the stringdist_inner_join and filter afterwards if the dates are stored as dates. This should be plenty performant for most data, and if not you should probably use data.table instead of fuzzyjoin.
library(fuzzyjoin) library(dplyr) table_1$date_1 = as.Date(table_1$date_1) table_2$date_2 = as.Date(table_2$date_2) table_2$date_3 = as.Date(table_2$date_3) stringdist_inner_join(table_1, table_2, by = "id", max_dist = 2) %>% filter(date_1 >= date_2, date_1 <= date_3) id.x date_1 id.y date_2 date_3 1 123 2010-01-31 5123 2009-01-31 2011-01-31 2 123 2010-01-31 123 A 2010-01-31 2010-01-31 3 123 2010-01-31 125 2010-01-31 2020-01-31 4 123 2010-01-31 125 2010-01-31 2020-01-31 5 123 2010-01-31 5123 2009-01-31 2011-01-31 6 123 2010-01-31 123 A 2010-01-31 2010-01-31 7 123 2010-01-31 125 2010-01-31 2020-01-31 8 123 2010-01-31 125 2010-01-31 2020-01-31 9 125 2016-01-31 125 2010-01-31 2020-01-31 10 125 2016-01-31 125 2010-01-31 2020-01-31