Skip to content
Advertisement

How to update rows and then delete all but one. Data deduplication

Given a table like this:

ID A B C D
01 3 2 1 0
01 5 2 1 0
01 0 2 1 0
00 4 8 1 1
00 4 8 1 1
00 4 8 1 1
03 6 4 0 0
03 0 2 0 0
03 6 4 0 0

How could I use SQL (T-SQL/SQL Server Mgt Studio), R, or Python to do the following in this order:

For each unique ID, update col A with the maximum value of A from all rows and then delete all rows except the one with the highest value in col B. I need to end up with only one row for a given ID.

Written another way; I have duplicate data and need to delete duplicate rows but only after ensuring I have the maximum value of col A from any duplicates applied to the row with the maximum value of col B. I want to only keep one row for a given ID but the row I want to keep is the one with the max col B. If there are multiple identical ID & col B rows, delete all but one.

Advertisement

Answer

In R:

library(dplyr)
your_data %>%
  group_by(ID) %>%
  mutate(A = max(A)) %>%
  slice(which.max(B)) %>%
  ungroup()

Or with data.table:

library(data.table)
setDT(your_data)
your_data[, A := max(A), by = ID][, .SD[which.max(B)], by = ID]
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement