Given a table like this: ID A B C D 01 3 2 1 0 01 5 2 1 0 01 0 2 1 0 00 4 8 1 1 00 4 8 1 1 00 4 8 1 1 03 6 4 0 0 03 0 2 0 0 03 6 4 0 0 How could I use SQL

How to update rows and then delete all but one. Da…

Given a table like this:

ID	A	B	C	D
01	3	2	1	0
01	5	2	1	0
01	0	2	1	0
00	4	8	1	1
00	4	8	1	1
00	4	8	1	1
03	6	4	0	0
03	0	2	0	0
03	6	4	0	0

How could I use SQL (T-SQL/SQL Server Mgt Studio), R, or Python to do the following in this order:

For each unique ID, update col A with the maximum value of A from all rows and then delete all rows except the one with the highest value in col B. I need to end up with only one row for a given ID.

Written another way; I have duplicate data and need to delete duplicate rows but only after ensuring I have the maximum value of col A from any duplicates applied to the row with the maximum value of col B. I want to only keep one row for a given ID but the row I want to keep is the one with the max col B. If there are multiple identical ID & col B rows, delete all but one.

Answer

In R:

library(dplyr)
your_data %>%
  group_by(ID) %>%
  mutate(A = max(A)) %>%
  slice(which.max(B)) %>%
  ungroup()

Or with data.table:

library(data.table)
setDT(your_data)
your_data[, A := max(A), by = ID][, .SD[which.max(B)], by = ID]

How to update rows and then delete all but one. Data deduplication

Advertisement

Answer