I have a dataset PosNeg like this. I need to find count of ID’s who have a pattern like this- P N P P or N P N N P N – that is having at least one N (negative) between two P’s(positive). If this pattern occurs at least once, then count that ID. Date is always in ascending order.
Eg: for ID 1, I have at least 1 N on 02/25 between two P’s, Hence I’ll count ID 1. ID 2 and 3 does not have N between 2 P’s, so those are not counted. ID 4 also has one N between two P’s on 03/18, hence I’ll include 4. So the total number of ID’s which satisfices the condition is 2(1 and 4)
My idea is to find min(date) of positive and max(date) of positive for each ID, and look for any negative in between these dates, but not sure how to implement it. Any suggestions in R/Python/SQL would be helpful.
ID | Test | Date |
---|---|---|
1 | P | 2021-01-02 |
1 | P | 2021-01-08 |
1 | N | 2021-02-25 |
1 | P | 2021-03-26 |
2 | N | 2021-02-05 |
2 | P | 2021-03-04 |
2 | N | 2021-03-30 |
3 | N | 2021-01-24 |
3 | P | 2021-02-10 |
4 | N | 2021-02-15 |
4 | P | 2021-02-28 |
4 | N | 2021-03-18 |
4 | P | 2021-04-11 |
Output:
Total |
---|
2 |
EDIT1: There could be multiple N’s (at least 1) between two P’s, not just 1, and I want to include it in my count.
EDIT2: I want this ID to be included but it is not included in the result dataframe. But, there are multiple N’s in between 2 P’s.
ID | DATE | TEST |
---|---|---|
1 | 2020-06-12 | N |
1 | 2020-08-20 | N |
1 | 2020-10-04 | N |
1 | 2020-12-09 | N |
1 | 2021-01-08 | P |
1 | 2021-02-05 | P |
1 | 2021-03-26 | P |
1 | 2021-05-26 | P |
1 | 2021-06-30 | N |
1 | 2021-07-21 | N |
1 | 2021-08-23 | N |
1 | 2021-09-16 | N |
1 | 2021-10-08 | N |
1 | 2021-10-18 | N |
1 | 2021-10-29 | P |
EDIT3: The ID in the previous edit is 1 while in my the output of my real data starts from 15. I think it should start from 1. Also, its is not N and P in my real data but ‘Negative’ and ‘Positive’. My code is now this:
data4c %>% group_by(STUDY_ID) %>% summarise(isP = str_detect(str_c(TEST, collapse = ""), "PositiveNegative+Positive"), .groups = 'drop') %>% filter(isP)
Advertisement
Answer
Here is one option with str_c/str_detect
– grouped by ‘ID’, paste
the ‘Test’ elements and then check whether the pattern P
followed by one or more N
(N+
) and then a P
occurs
library(stringr) library(dplyr) df1 %>% group_by(ID) %>% summarise(isP = str_detect(str_c(substr(Test,1, 1) collapse = ""), "PN+P"), .groups = 'drop') %>% filter(isP) # A tibble: 2 × 2 ID isP <int> <lgl> 1 1 TRUE 2 4 TRUE
Using the OP’s new data
> df2 %>% group_by(ID) %>% summarise(isP = str_detect(str_c(substr(TEST,1, 1), collapse = ""), "PN+P"), .groups = 'drop') %>% filter(isP) # A tibble: 1 × 2 ID isP <int> <lgl> 1 1 TRUE
EDIT: added substr
to extract the first letter in ‘Test’ column as the original data values are not ‘P’ or ‘N’ as showed in example
data
df2 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), DATE = c("2020-06-12", "2020-08-20", "2020-10-04", "2020-12-09", "2021-01-08", "2021-02-05", "2021-03-26", "2021-05-26", "2021-06-30", "2021-07-21", "2021-08-23", "2021-09-16", "2021-10-08", "2021-10-18", "2021-10-29"), TEST = c("N", "N", "N", "N", "P", "P", "P", "P", "N", "N", "N", "N", "N", "N", "P")), class = "data.frame", row.names = c(NA, -15L))