Skip to content
Advertisement

Filtering out non-zero values unless the value is the only one within its group (SQL or R)

I have a data set of record numbers, with multiple values in a column for each record. I would like to produce a table where within each group it returns the minimum non-negative, non-NA value. But if that negative number or NA corresponds to the only row for that record, then I would like it to return NA.

data <- data.frame(record = c(1, 1, 1, 2, 3, 3, 4), value = c(-1, 2, 40, -10, 3, 4, NA))

| record | value |
------------------
|   1    |  -1   |
|   1    |   2   |
|   1    |   40  |
|   2    |  -10  |
|   3    |   3   |
|   3    |   4   |
|   4    |   NA  |
------------------

I would like to get

| record |  out  |
------------------
|   1    |   2   |
|   2    |  NA   |
|   3    |   3   |
|   4    |  NA   |
------------------

In R I have tried:

filter(data, value >= 0 | is.na(value))
data %>% group_by(record) %>% summarize(out = min(value))

In MySQL I have tried

select min(value) as out
from data
where (value >= 0 or value is null))

However, both of these solutions drop record #2 from the table.

How do I include a clause that detects the number of rows per record and doesn’t eliminate single records?

Advertisement

Answer

One option is to replace those ‘value’ less than 0 to NA, grouped by ‘record’, slice the rows where we get the min of ‘value’ or if all are NA, get the first row

data %>%
   mutate(value = replace(value, value < 0, NA)) %>% 
   group_by(record) %>% 
   slice(if(all(is.na(value))) 1 else which.min(value))
# A tibble: 4 x 2
# Groups:   record [4]
#  record value
#   <dbl> <dbl>
#1      1     2
#2      2    NA
#3      3     3
#4      4    NA
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement