For a minimal, reproducible example (reprex) let’s assume I have a database object (dbo) in a Microsoft SQL Server and I want to query things in T-SQL.
My dbo looks like this:
Animal-ID Marker-ID Allele1 Allele2 -------------------------------------------- 1 OAR1 A G 1 OAR2 C C 1 OAR3 T G 2 OAR1 A C 2 OAR2 C C 2 OAR3 A C
What I would like to do is calculate an allele match percentage per Marker-ID across all Animal-IDs.
Given the dbo example from above the desired result looks like this:
Animal-ID-pair Marker-ID Match-percentage -------------------------------------------- 1-2 OAR1 50 1-2 OAR2 100 1-2 OAR3 0
So far, I tried the following approaches:
First I thought selecting individual rows is sufficient.
SELECT * FROM (SELECT ROW_NUMBER() OVER (ORDER BY Animal-ID ASC) AS rownumber, Animal-ID, Marker-ID, Allele1, Allele2 FROM dbo) AS foo WHERE rownumber BETWEEN 1 AND 3;
and then compare that to the range between 4 and 6.
The problem here is that, in my real and way lager data set, not all animal-ID pairs have the same number of rows, i.e. not the same number of markers.
That is why I thought grouping might be helpful:
SELECT Animal-ID, Marker-ID, Allele1, Allele2 FROM dbo WHERE Animal-ID IN (SELECT Animal-ID FROM dbo GROUP BY Animal-ID HAVING COUNT(*) > 1);
but that does not allow me to do comparisons and/or calculations across groups.
Thus I would like to ask how to calculate the degree of agreement in the comparison of row pairs.
Advertisement
Answer
Sample data
create table genomes ( AnimalId int, MarkerId nvarchar(10), Allele1 nvarchar(1), Allele2 nvarchar(2) ) insert into genomes (AnimalId, MarkerId, Allele1, Allele2) values (1, 'OAR1', 'A', 'G'), (1, 'OAR2', 'C', 'C'), (1, 'OAR3', 'T', 'G'), (2, 'OAR1', 'A', 'C'), (2, 'OAR2', 'C', 'C'), (2, 'OAR3', 'A', 'C'), (3, 'OAR1', 'A', 'G'), --new sample Animal with less data (no OAR3) (3, 'OAR2', 'C', 'G');
Solution
- Select all unique animals
cte_AllAnimals
. - Select all unique markers
cte_AllMarkers
. - Combine every animal with every animal behind it
a2.AnimalId > a1.AnimalId
. This will give you all unique animal combinations. - Combine every pair with every marker
cross join cte_AllMarkers
.
This gives me:
with cte_AllMarkers as ( select g.MarkerId from genomes g group by g.MarkerId ), cte_AllAnimals as ( select g.AnimalId from genomes g group by g.AnimalId ) select convert(nvarchar(10), a1.AnimalId) + '-' + convert(nvarchar(10), a2.AnimalId) as AnimalIdPair, m.MarkerId, case g1.Allele1 when g2.Allele1 then 50 else 0 end + case g1.Allele2 when g2.Allele2 then 50 else 0 end as MatchPercentage from cte_AllAnimals a1 join cte_AllAnimals a2 on a2.AnimalId > a1.AnimalId cross join cte_AllMarkers m left join genomes g1 on g1.AnimalId = a1.AnimalId and g1.MarkerId = m.MarkerId left join genomes g2 on g2.AnimalId = a2.AnimalId and g2.MarkerId = m.MarkerId order by a1.AnimalId, a2.AnimalId, m.MarkerId;
Result
AnimalIdPair MarkerId MatchPercentage ------------ -------- --------------- 1-2 OAR1 50 1-2 OAR2 100 1-2 OAR3 0 1-3 OAR1 100 1-3 OAR2 50 1-3 OAR3 0 2-3 OAR1 50 2-3 OAR2 50 2-3 OAR3 0
Fiddle to see it in action.