Skip to content
Advertisement

How to calculate the degree of agreement by row comparisons in SQL Server?

For a minimal, reproducible example (reprex) let’s assume I have a database object (dbo) in a Microsoft SQL Server and I want to query things in T-SQL.

My dbo looks like this:

Animal-ID   Marker-ID   Allele1     Allele2
--------------------------------------------
1           OAR1        A           G
1           OAR2        C           C
1           OAR3        T           G
2           OAR1        A           C
2           OAR2        C           C
2           OAR3        A           C

What I would like to do is calculate an allele match percentage per Marker-ID across all Animal-IDs.

Given the dbo example from above the desired result looks like this:

Animal-ID-pair  Marker-ID   Match-percentage
--------------------------------------------
1-2             OAR1        50
1-2             OAR2        100
1-2             OAR3        0

So far, I tried the following approaches:

First I thought selecting individual rows is sufficient.

SELECT *
FROM 
    (SELECT
         ROW_NUMBER() OVER (ORDER BY Animal-ID ASC) AS rownumber,
         Animal-ID, Marker-ID,
         Allele1, Allele2
     FROM 
         dbo) AS foo
WHERE 
    rownumber BETWEEN 1 AND 3;

and then compare that to the range between 4 and 6.

The problem here is that, in my real and way lager data set, not all animal-ID pairs have the same number of rows, i.e. not the same number of markers.

That is why I thought grouping might be helpful:

SELECT 
    Animal-ID, Marker-ID,
    Allele1, Allele2
FROM 
    dbo
WHERE 
    Animal-ID IN (SELECT Animal-ID FROM dbo
                  GROUP BY Animal-ID 
                  HAVING COUNT(*) > 1);

but that does not allow me to do comparisons and/or calculations across groups.

Thus I would like to ask how to calculate the degree of agreement in the comparison of row pairs.

Advertisement

Answer

Sample data

create table genomes
(
  AnimalId int,
  MarkerId nvarchar(10),
  Allele1 nvarchar(1),
  Allele2 nvarchar(2)
)

insert into genomes (AnimalId, MarkerId, Allele1, Allele2) values
(1, 'OAR1', 'A', 'G'),
(1, 'OAR2', 'C', 'C'),
(1, 'OAR3', 'T', 'G'),
(2, 'OAR1', 'A', 'C'),
(2, 'OAR2', 'C', 'C'),
(2, 'OAR3', 'A', 'C'),
(3, 'OAR1', 'A', 'G'), --new sample Animal with less data (no OAR3)
(3, 'OAR2', 'C', 'G');

Solution

  1. Select all unique animals cte_AllAnimals.
  2. Select all unique markers cte_AllMarkers.
  3. Combine every animal with every animal behind it a2.AnimalId > a1.AnimalId. This will give you all unique animal combinations.
  4. Combine every pair with every marker cross join cte_AllMarkers.

This gives me:

with cte_AllMarkers as
(
  select g.MarkerId
  from genomes g
  group by g.MarkerId
),
cte_AllAnimals as
(
  select g.AnimalId
  from genomes g
  group by g.AnimalId
)
select convert(nvarchar(10), a1.AnimalId) + '-' +
       convert(nvarchar(10), a2.AnimalId) as AnimalIdPair,
       m.MarkerId,
       case g1.Allele1 when g2.Allele1 then 50 else 0 end +
       case g1.Allele2 when g2.Allele2 then 50 else 0 end as MatchPercentage
from cte_AllAnimals a1
join cte_AllAnimals a2
  on a2.AnimalId > a1.AnimalId
cross join cte_AllMarkers m
left join genomes g1
  on  g1.AnimalId = a1.AnimalId
  and g1.MarkerId = m.MarkerId
left join genomes g2
  on  g2.AnimalId = a2.AnimalId
  and g2.MarkerId = m.MarkerId
order by a1.AnimalId,
         a2.AnimalId,
         m.MarkerId;

Result

AnimalIdPair MarkerId MatchPercentage
------------ -------- ---------------
1-2          OAR1     50
1-2          OAR2     100
1-2          OAR3     0
1-3          OAR1     100
1-3          OAR2     50
1-3          OAR3     0
2-3          OAR1     50
2-3          OAR2     50
2-3          OAR3     0

Fiddle to see it in action.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement