The Problem

While using a CTE built from the information_schema, SELECT DISTINCT within a subquery, and an ORDER BY in the outer query, the output included duplicated results unexpectedly.

I’ve tried creating some small examples where this behaviour happens but I can’t get them to reproduce this behaviour, so I’ve used the specific case where I’m getting this problem to demonstrate it.

Example

I encountered this while putting together a table that lists all of the column_names across a selection of tables, with a column corresponding to each table and a true/false in the column indicating whether the column_name exists in that table.

The following code sets up three tables that will be used to illustrate this problem.

CREATE TEMPORARY TABLE table_0(c1 VARCHAR, c2 VARCHAR);
CREATE TEMPORARY TABLE table_1(            c2 VARCHAR);
CREATE TEMPORARY TABLE table_2(c1 VARCHAR);

​x
 
CREATE TEMPORARY TABLE table_0(c1 VARCHAR, c2 VARCHAR);CREATE TEMPORARY TABLE table_1(            c2 VARCHAR);CREATE TEMPORARY TABLE table_2(c1 VARCHAR);​

The expected output is in the table below.

column_name	table_0	table_1	table_2
c1	true	false	true
c2	true	true	false

Working Query

I can get this output with the following query:

WITH
    table_cols AS (
        SELECT table_name, column_name
        FROM information_schema.columns
        WHERE table_name IN (
            'table_0',
            'table_1',
            'table_2'
        )
    )

SELECT
    base.column_name,
    (table_0.column_name IS NOT NULL) AS table_0,
    (table_1.column_name IS NOT NULL) AS table_1,
    (table_2.column_name IS NOT NULL) AS table_2
FROM (
    SELECT DISTINCT column_name FROM table_cols
) AS base
    LEFT JOIN (
        SELECT * FROM table_cols WHERE table_name = 'table_0'
    ) AS table_0 USING(column_name)
    LEFT JOIN (
        SELECT * FROM table_cols WHERE table_name = 'table_1'
    ) AS table_1 USING(column_name)
    LEFT JOIN (
        SELECT * FROM table_cols WHERE table_name = 'table_2'
    ) AS table_2 USING(column_name)
-- ORDER BY 1
;

 
WITH    table_cols AS (        SELECT table_name, column_name        FROM information_schema.columns        WHERE table_name IN (            'table_0',            'table_1',            'table_2'        )    )​SELECT    base.column_name,    (table_0.column_name IS NOT NULL) AS table_0,    (table_1.column_name IS NOT NULL) AS table_1,    (table_2.column_name IS NOT NULL) AS table_2FROM (    SELECT DISTINCT column_name FROM table_cols) AS base    LEFT JOIN (        SELECT * FROM table_cols WHERE table_name = 'table_0'    ) AS table_0 USING(column_name)    LEFT JOIN (        SELECT * FROM table_cols WHERE table_name = 'table_1'    ) AS table_1 USING(column_name)    LEFT JOIN (        SELECT * FROM table_cols WHERE table_name = 'table_2'    ) AS table_2 USING(column_name)-- ORDER BY 1;​

Faulty Query

However, when I uncomment the ORDER BY 1 at the bottom, the output is not deduplicated and produces the following:

column_name	table_0	table_1	table_2
c1	true	false	true
c1	true	false	true
c2	true	true	false

In particular, the first and second row are duplicates.

This is particularly unexpected because base is the list of deduplicated column_names, which is joined onto tables where the column_names are unique, so the duplicates are not coming from the joins.

This is also ‘fixed’ if I put a DISTINCT keyword after the outer query’s SELECT statement, but I originally omitted putting it there because I had expected the base subquery to do the deduping before it does the ORDER BY.

The Question

What causes this behaviour?

I can find some fixes for this, but I want to understand this behaviour so that I can better troubleshoot — and avoid — ‘problems’ like this again in the future.

Answer

Looks like you stumbled on a bug. Looks like the ‘faulty’ query applies the joins to the result of the WITH, not to the SELECT DISTINCT on it.

DISTINCT is rather prone to implementation errors because vendors in general consider it a ‘misfeature’ that should be avoided by users and therefore prefer spending the least time possible on getting it to work right in their product. Try it out and submit the bug report. It would surprise me if the first answer you’re getting would not be something like “it’s better to avoid DISTINCT”.

That’s relatively funny because the relational model in fact requires every SELECT to be of the DISTINCT kind …

SELECT DISTINCT in Redshift subquery not deduping when using ORDER BY in outer query