Select values from one column which share a value in another column

Here is the problem. The database describes symmetrical and transferable relations between objects (genes from different species). If a gene X from species 1 is in relation to gene Y from species 2, and gene Y from species 2 is in relation to gene Z in species 3, then gene X from species 1 is in relation to gene Z in species 3.

Here is an example table:

species1 gene1 species2 gene2
2        Y     1        X
2        Y     3        Z

Now, here is what I want to do. Given species 1 and 3 (both in column species2), find all genes in column gene2 for which there is a common value in column gene1.

Basically, this is the output I would like to have:

X  Z

… for each pair that satisfies this condition.

Note that each gene id is uniquely present in only one species.

Explanation: column species1 contains a handful of “hub” species (like humans). So if I want to find a rat gene matching a mouse gene Card9, there are two possibilities: (i) there is no match in human genome, and in this case there will be a rat match in column gene1, or the mouse gene will be in gene1 and the rat gene in column gene2; (ii) there is a match in human (or some other) genome, in which case I need first to find the match in the human genome in column gene1, and then find the match to that gene in the rat genome in column gene2.

Clearly, I imagine that I would be able to do it outside of SQL:

Select all genes in column gene2 from species 1
Select all genes in column gene2 from species 3
For each gene from species 1, find matching genes in column gene1 for which column species2 is 3.

However, I am sure that there is a clever SQL statement that could do that.

The database has about 5 millions of rows.

Answer

You can do it with a self join:

select t1.gene2 result1, t2.gene2 result2 
from tablename t1 inner join tablename t2
on t2.gene1 = t1.gene1
where t1.species2 = 1 and t2.species2 = 3

See the demo.
Another way to do it with aggregation:

select min(gene2) result1, max(gene2) result2
from tablename
where species2 in (1, 3)
group by gene1
having result1 <> result2;

See the demo.
Results:

| result1 | result2 |
| ------- | ------- |
| X       | Z       |

Advertisement

Answer