I have a table generated from a CRM software and it has a lot of repited individuals in different rows but with different fields completed in each repetition, something like this:
id | birth_date | sex | postal_code | customer | smoker |
---|---|---|---|---|---|
001 | NULL | NULL | 00067 | Yes | 1 |
001 | NULL | Male | 00067 | NULL | 1 |
001 | 21/03/1994 | NULL | 00067 | NULL | NULL |
002 | NULL | Female | NULL | NULL | NULL |
002 | NULL | NULL | 09986 | No | 0 |
003 | 13/01/1986 | NULL | NULL | No | 1 |
The desired table extracted from this one should be this:
id | birth_date | sex | postal_code | customer | smoker |
---|---|---|---|---|---|
001 | 21/03/1994 | Male | 00067 | Yes | 1 |
002 | NULL | Female | 09986 | No | 0 |
003 | 13/01/1986 | NULL | NULL | No | 1 |
So as you can see, there are some columns completed in some repetitions, so the idea is to avoid repetitions (here we see a repetition because there’s the same id for more than one row, the id column should be the primary key) and get the maximum information about that id.
Any idea of how to achieve that?
Advertisement
Answer
You can aggregate the rows and take the max non-null value, as in:
select id, max(birth_date) as birth_date, max(sex) as sex, max(postal_code) as postal_code, max(customer) as customer, max(smoker) as smoker from t group by id
Now, this solution won’t account for inconsistent data when, for example, the same person has two different birth dates.