Skip to content
Advertisement

Amazon Redshift – lateral column alias reference

Based on

Amazon Redshift announces support for lateral column alias reference:

The support for lateral column alias reference enables you to write queries without repeating the same expressions in the SELECT list. For example, you can define the alias ‘probability’ and use it within the same select statement:

select clicks / impressions as probability, 
        round(100 * probability, 1) as percentage from raw_data;

Which is basically the same as:

select 1 AS col
      ,col + 1 AS col2;

db<>fiddle demo

Most SQL RDBMSes will return an error: Unknown column 'col' in 'field list'


It looks like as interesting language extension but there is a caveat. What if I have an undeterministic function:

select RAND() AS col
      ,col + 1 AS col2

-- if RAND() returns 0.5 then I would expect
-- 0.5 and 1.5

-- I get: 0.3 and 1.7
-- it means that the query was evaluated as:
select RAND() AS col,
       RAND() + 1 AS col2

Comparing with LATERAL JOIN from PostgreSQL(yes, I am aware this is different feature, I would expect “lateral coulmn alias” to behave the same way):

SELECT s.col, s.col+1 AS col2
FROM t ,LATERAL (SELECT RANDOM()) AS s(col)  
-- 0.19089933477628307  1.190899334776283

db<>fiddle demo

But it is not a case. I am getting two independent runs which seems to be valid if it is simple “inlining”:

SELECT List

The alias is recognized right after it is defined in the target list. You can use an alias in other expressions defined after it in the same target list. The following example illustrates this.

The benefit of the lateral alias reference is you don’t need to repeat the aliased expression when building more complex expressions in the same target list. When Amazon Redshift parses this type of reference, it just inlines the previously defined aliases. If there is a column with the same name defined in the FROM clause as the previously aliased expression, the column in the FROM clause takes priority.

Is my understanding correct and this functionality is not “safe” when we are using undeterministic or time-sensitive function/references/subqueries?

Advertisement

Answer

This syntax is not safe. In fact, merely inlining the code means that it does not even provide a performance advantage. It is only syntactic sugar.

Given that there are easy alternatives — CTEs and subqueries — I would just avoid this new “feature”.

If there were a setting to turn this off, I would recommend using it.

Incidentally, many newcomers to SQL find this quite disconcerting. This purpose is to avoid ambiguity. What should the following query return?

select (a + 1) as b, b 
from (select 1 as a, 0 as b) x;

The designers of SQL probably felt that the rules around resolving such situations are more complex than merely rewriting a subquery.

The one “database” that I know of that resolves this well is actually SAS proc SQL. It introduced the calculated keyword, so you can write:

select (a + 1) as b, calculated b, b
from (select 1 as a, 0 as b) x;

And this would return 2, 2, 0.

In other words, I don’t think Amazon put much thought into the implementation of this “feature”.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement