I just started studying SQL and this is a demo given by the teacher in an online course and it works fine. The statement is looking for “students such that number of other students with same GPA is equal to number of other students with same sizeHS”:
select * from Student S1 where ( select count(*) from Student S2 where S2.sID <> S1.sID and S2.GPA = S1.GPA ) = ( select count(*) from Student S2 where S2.sID <> S1.sID and S2.sizeHS = S1.sizeHS );
It seems that in this where
clause, we’re comparing two relations (because the result of a subquery is a relation), but most of the time we are comparing attributes(as far as I’ve seen).
So I’m thinking about whether there are requirements for how many attributes, and how many tuples, the RELATION should contain when comparing two RELATIONS. If not, how do we compare two RELATIONS when there’re multiple attributes or multiple tuples and what do we get for result? Note:
Student relation has 4 attributes: sID, sName, GPA, sizeHS. And here’s the data:
+-----+--------+-----+--------+ | sID | sName | GPA | sizeHS | +-----+--------+-----+--------+ | 123 | Amy | 3.9 | 1000 | | 234 | Bob | 3.6 | 1500 | | 345 | Craig | 3.5 | 500 | | 456 | Doris | 3.9 | 1000 | | 567 | Edward | 2.9 | 2000 | | 678 | Fay | 3.8 | 200 | | 789 | Gary | 3.4 | 800 | | 987 | Helen | 3.7 | 800 | | 876 | Irene | 3.9 | 400 | | 765 | Jay | 2.9 | 1500 | | 654 | Amy | 3.9 | 1000 | | 543 | Craig | 3.4 | 2000 | +-----+--------+-----+--------+
and the result of this query is:
+-----+--------+-----+---------+ | sID | sName | GPA | sizeHS | +-----+--------+-----+---------+ | 345 | Craig | 3.5 | 500 | | 567 | Edward | 2.9 | 2000 | | 678 | Fay | 3.8 | 200 | | 789 | Gary | 3.4 | 800 | | 765 | Jay | 2.9 | 1500 | | 543 | Craig | 3.4 | 2000 | +-----+--------+-----+---------+
Advertisement
Answer
because the result of a subquery is a relation
Relation is the scientific name for what we call a table in a database and I like the name “table” much better than “relation”. A table is easy to imagine. We know them from our school time schedule for instance. Yes, we relate things here inside a table (day and time and the subject taught in school), but we can also relate tables to tables (pupils’ timetables with the table of class rooms, the overall subject schedule, and the teacher’s timetables). As such, tables in an RDBMS are also related to each other (hence the name relational database management system). I find the name relation for a table quite confusing (and many people use the word “relation” to describe the relations between tables instead).
So, yes, a query result itself is again a table (“relation”). And from tables we can of course select:
select * from (select * from b) as subq;
And then there are scalar queries that return exactly one row and one column. select count(*) from b
is such a query. While this is still a table we can select from
select * from (select count(*) as cnt from b) as subq;
we can even use them where we usually have single values, e.g. in the select clause:
select a.*, (select count(*) from b) as cnt from a;
In your query you have two scalar subqueries in your where clause.
With subqueries there is another distinction to make: we have correlated and non-correlated subqueries. The last query I have just shown contains a non-correlated subquery. It selects the count of b rows for every single result row, no matter what that row contains elsewise. A correlated subquery on the other hand may look like this:
select a.*, (select count(*) from b where b.x = a.y) as cnt from a;
Here, the subquery is related to the main table. For every result row we look up the count of b rows matching the a row we are displaying via where b.x = a.y
, so the count is different from row to row (but we’d get the same count for a rows sharing the same y value).
Your subqueries are also correlated. As with the select clause, the where clause deals with one row at a time (in order to keep or dismiss it). So we look at one student S1 at a time. For this student we count other students (S2, where S2.sID <> S1.sID
) who have the same GPA (and S2.GPA = S1.GPA
) and count other students who have the same sizeHS. We only keep students (S1) where there are exactly as many other students with the same GPA as there are with the same sizeHS.
UPDATE
As do dealing with multiple tuples as in
select * from Student S1 where ( select count(*), avg(grade) from Student S2 where S2.sID <> S1.sID and S2.GPA = S1.GPA ) = ( select count(*), avg(grade) from Student S2 where S2.sID <> S1.sID and S2.sizeHS = S1.sizeHS );
this is possible in some DBMS, but not in SQL Server. SQL Server doesn’t know tuples.
But there are other means to achieve the same. You could just add two subqueries:
select * from student s1 where (...) = (...) -- compare counts here and (...) = (...) -- compare averages here
Or get the data in the FROM
clause and then deal with it. E.g.:
select * from Student S1 cross apply ( select count(*) as cnt, avg(grade) as avg_grade from Student S2 where S2.sID <> S1.sID and S2.GPA = S1.GPA ) sx cross apply ( select count(*) as cnt, avg(grade) as avg_grade from Student S2 where S2.sID <> S1.sID and S2.sizeHS = S1.sizeHS ) sy where sx.cnt = sy.cnt and sx.avg_grade = sy.avg_grade;