The Problem
While using a CTE built from the information_schema
, SELECT DISTINCT
within a subquery, and an ORDER BY
in the outer query, the output included duplicated results unexpectedly.
I’ve tried creating some small examples where this behaviour happens but I can’t get them to reproduce this behaviour, so I’ve used the specific case where I’m getting this problem to demonstrate it.
Example
I encountered this while putting together a table that lists all of the column_name
s across a selection of tables, with a column corresponding to each table and a true/false in the column indicating whether the column_name
exists in that table.
The following code sets up three tables that will be used to illustrate this problem.
CREATE TEMPORARY TABLE table_0(c1 VARCHAR, c2 VARCHAR); CREATE TEMPORARY TABLE table_1( c2 VARCHAR); CREATE TEMPORARY TABLE table_2(c1 VARCHAR);
The expected output is in the table below.
column_name | table_0 | table_1 | table_2 |
---|---|---|---|
c1 | true | false | true |
c2 | true | true | false |
Working Query
I can get this output with the following query:
WITH table_cols AS ( SELECT table_name, column_name FROM information_schema.columns WHERE table_name IN ( 'table_0', 'table_1', 'table_2' ) ) SELECT base.column_name, (table_0.column_name IS NOT NULL) AS table_0, (table_1.column_name IS NOT NULL) AS table_1, (table_2.column_name IS NOT NULL) AS table_2 FROM ( SELECT DISTINCT column_name FROM table_cols ) AS base LEFT JOIN ( SELECT * FROM table_cols WHERE table_name = 'table_0' ) AS table_0 USING(column_name) LEFT JOIN ( SELECT * FROM table_cols WHERE table_name = 'table_1' ) AS table_1 USING(column_name) LEFT JOIN ( SELECT * FROM table_cols WHERE table_name = 'table_2' ) AS table_2 USING(column_name) -- ORDER BY 1 ;
Faulty Query
However, when I uncomment the ORDER BY 1
at the bottom, the output is not deduplicated and produces the following:
column_name | table_0 | table_1 | table_2 |
---|---|---|---|
c1 | true | false | true |
c1 | true | false | true |
c2 | true | true | false |
In particular, the first and second row are duplicates.
This is particularly unexpected because base
is the list of deduplicated column_name
s, which is joined onto tables where the column_name
s are unique, so the duplicates are not coming from the joins.
This is also ‘fixed’ if I put a DISTINCT
keyword after the outer query’s SELECT
statement, but I originally omitted putting it there because I had expected the base
subquery to do the deduping before it does the ORDER BY
.
The Question
What causes this behaviour?
I can find some fixes for this, but I want to understand this behaviour so that I can better troubleshoot — and avoid — ‘problems’ like this again in the future.
Advertisement
Answer
Looks like you stumbled on a bug. Looks like the ‘faulty’ query applies the joins to the result of the WITH, not to the SELECT DISTINCT on it.
DISTINCT is rather prone to implementation errors because vendors in general consider it a ‘misfeature’ that should be avoided by users and therefore prefer spending the least time possible on getting it to work right in their product. Try it out and submit the bug report. It would surprise me if the first answer you’re getting would not be something like “it’s better to avoid DISTINCT”.
That’s relatively funny because the relational model in fact requires every SELECT to be of the DISTINCT kind …