Skip to content
Advertisement

Three-Way Diff in SQL

I have three SQL tables (A, B, and C), representing three different version of a dataset. I want to devise an SQL query whose effect is to extract the ids of rows/tuples whose values are different in all three tables. (Two records are different if there exists a field where the records do not share the same value.) For simplicity, let’s just assume that A, B, and C each have N records with record ids ranging from 1 to N, so for every id from 1 to N, there is a record in each table with that ID.

What might be the most efficient way to do this in SQL? One way would be to do something like

(SELECT id FROM 
((SELECT * FROM A EXCEPT SELECT * FROM B) EXCEPT (SELECT * FROM C)) result)
EXCEPT
(SELECT id FROM
((SELECT * FROM A) INTERSECT (SELECT * FROM B)) result2)

Basically what I’ve done above is first found the ids of records where the version in A differs from the version of B and from the version in C (in the first two lines of the SQL query I’ve written). What’s left is to filter out the ids of record where the version in B matches the version in C (which is done in the last two lines). But this seems horribly inefficient; is there a better, more concise way?

Note: I’m using PostgreSQL syntax here.

Advertisement

Answer

I would do it like this:

select id, 
       a.id is null as "missing in a",
       b.id is null as "missing in b",
       c.id is null as "missing in c",
       a is distinct from b as "a and b different",
       a is distinct from c as "a and c different",
       b is distinct from c as "b and c different"
from a 
  full join b using (id)
  full join c using (id) 
where a is distinct from b
   or b is distinct from c
   or a is distinct from c   

The id column is assumed to be a primary (or unique) key.

Online example

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement