Given a table like this:
ID | A | B | C | D |
---|---|---|---|---|
01 | 3 | 2 | 1 | 0 |
01 | 5 | 2 | 1 | 0 |
01 | 0 | 2 | 1 | 0 |
00 | 4 | 8 | 1 | 1 |
00 | 4 | 8 | 1 | 1 |
00 | 4 | 8 | 1 | 1 |
03 | 6 | 4 | 0 | 0 |
03 | 0 | 2 | 0 | 0 |
03 | 6 | 4 | 0 | 0 |
How could I use SQL (T-SQL/SQL Server Mgt Studio), R, or Python to do the following in this order:
For each unique ID, update col A with the maximum value of A from all rows and then delete all rows except the one with the highest value in col B. I need to end up with only one row for a given ID.
Written another way; I have duplicate data and need to delete duplicate rows but only after ensuring I have the maximum value of col A from any duplicates applied to the row with the maximum value of col B. I want to only keep one row for a given ID but the row I want to keep is the one with the max col B. If there are multiple identical ID & col B rows, delete all but one.
Advertisement
Answer
In R:
library(dplyr) your_data %>% group_by(ID) %>% mutate(A = max(A)) %>% slice(which.max(B)) %>% ungroup()
Or with data.table
:
library(data.table) setDT(your_data) your_data[, A := max(A), by = ID][, .SD[which.max(B)], by = ID]