I came across this question and was trying to understand how the answer given there is actually transforming the data.
Input Table
+---------+-------+---------------+ | user_id | State | Subscriptions | +---------+-------+---------------+ | 1 | LA | 4 | | 2 | LA | 4 | | 3 | LA | 12 | | 4 | LA | 12 | | 5 | LA | 8 | | 6 | LA | 3 | | 7 | NY | 14 | | 8 | NY | 15 | | 9 | NY | 3 | | 10 | NY | 2 | | 11 | NY | 4 | | 12 | NY | 12 | | 13 | OH | 6 | | 14 | OH | 8 | | 15 | OH | 2 | | 16 | OH | 3 | +---------+-------+---------------+
Output Table
+--------------------+----+----+----+ | Subscription_Range | LA | NY | OH | +--------------------+----+----+----+ | 1 to 4 | 3 | 3 | 2 | | 5 to 11 | 1 | 0 | 2 | | 12 to 15 | 2 | 3 | 0 | +--------------------+----+----+----+
Answer given by Gordon Linoff:
select (case when subscriptions <= 4 then '1 to 4' when subscriptions <= 11 then '5 to 11' when subscriptions <= 15 then '12 to 15' end) as subscription_range, sum(case when state = 'LA' then 1 else 0 end) as LA, sum(case when state = 'NY' then 1 else 0 end) as NY, sum(case when state = 'OH' then 1 else 0 end) as OH from t group by (case when subscriptions <= 4 then '1 to 4' when subscriptions <= 11 then '5 to 11' when subscriptions <= 15 then '12 to 15' end) order by min(subscriptions);
I wanted to understand fundamentally how this query executes.
For example:
- When the first row is selected will the
Subscriptions
column be checked first? (Since it is the first that is checked with acase
in the query.) - After checking it turns out that it should be assigned
1 to 4
. Now what next? - Will the state column be checked? It turns out that it is
LA
but I am not getting how the further execution is happening. I am trying to imagine the table forming before aggregation.
Does SQL operate in a row-wise fashion? As in, each and every row is picked up from the database and the corresponding part of the query is applied to each column? (Like case
is applied to the Subscriptions
column in this case.)
Advertisement
Answer
When evaluating the expressions within the select
clause prior to aggregation, you can imagine that the following table would be obtained for the given dataset:
+--------------------+-----+----+----+ | subscription_range | LA | NY | OH | +--------------------+-----+----+----+ | '1 to 4' | 1 | 0 | 0 | | '1 to 4' | 1 | 0 | 0 | | '12 to 15' | 1 | 0 | 0 | | '12 to 15' | 1 | 0 | 0 | | '5 to 11' | 1 | 0 | 0 | | '1 to 4' | 1 | 0 | 0 | | '12 to 15' | 0 | 1 | 0 | | '12 to 15' | 0 | 1 | 0 | | '1 to 4' | 0 | 1 | 0 | | '1 to 4' | 0 | 1 | 0 | | '1 to 4' | 0 | 1 | 0 | | '12 to 15' | 0 | 1 | 0 | | '5 to 11' | 0 | 0 | 1 | | '5 to 11' | 0 | 0 | 1 | | '1 to 4' | 0 | 0 | 1 | | '1 to 4' | 0 | 0 | 1 | +--------------------+-----+----+----+
Here, for each row in the dataset, the first case
statement yields a string, and the subsequent case
statements yield either a 1
or 0
depending on whether the state
column validates the test expression.
When aggregated, the query calculates the same set of subscription_range
values and groups by
this data such that each subscription_range
is distinct.
The remaining numerical data in the subsequent columns is then summed over each group by the sum
expressions enclosing each case
statement, yielding:
+--------------------+----+----+----+ | subscription_range | LA | NY | OH | +--------------------+----+----+----+ | 1 to 4 | 3 | 3 | 2 | | 5 to 11 | 1 | 0 | 2 | | 12 to 15 | 2 | 3 | 0 | +--------------------+----+----+----+