I’m using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:
a | b | u | t -----+---+----+---- foo | 1 | 1 | 10 foo | 1 | 2 | 11 foo | 1 | 2 | 11 foo | 2 | 4 | 1 foo | 3 | 5 | 2 bar | 1 | 6 | 2 bar | 2 | 7 | 2 bar | 2 | 8 | 3 bar | 3 | 9 | 4 bar | 4 | 10 | 5 bar | 5 | 11 | 6 baz | 1 | 12 | 1 baz | 1 | 13 | 2 baz | 1 | 13 | 2 baz | 1 | 13 | 3
There are indices on md5(a)
, on b
, and on (md5(a), b)
. (In reality, a
may contain values longer than 4k chars.) There is also a primary key column of type SERIAL
which I have omitted above.
I’m trying to build a query which will return the following results:
a | b | u | t | z -----+---+----+----+--- foo | 1 | 1 | 10 | 3 foo | 1 | 2 | 11 | 3 foo | 2 | 4 | 1 | 3 foo | 3 | 5 | 2 | 3 bar | 1 | 6 | 2 | 5 bar | 2 | 7 | 2 | 5 bar | 2 | 8 | 3 | 5 bar | 3 | 9 | 4 | 5 bar | 4 | 10 | 5 | 5 bar | 5 | 11 | 6 | 5
In these results, all rows are deduplicated as if GROUP BY a, b, u, t
were applied, z
is a count of distinct values of b
for every partition over a
, and only rows with a z
value greater than 2 are included.
I can get just the z
filter working as follows:
SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a HAVING COUNT(b) > 2;
However, I’m stumped on combining this with the rest of the data in the table.
What’s the most efficient way to do this?
Advertisement
Answer
Your first step can be simpler already:
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z FROM t GROUP BY 1 HAVING count(DISTINCT b) > 2;
Working with md5(a)
in place of a
, since a
can obviously be very long, and you already have an index on md5(a)
etc.
Since your table is big, you need an efficient query. This should be among the fastest possible solutions – with adequate index support. Your index on (md5(a), b)
is instrumental but – assuming b
, u
, and t
are small columns – an index on (md5(a), b, u, t)
would be even better for the second step of the query (the lateral join).
Your desired end result:
SELECT DISTINCT ON (md5(t.a), b, u, t) t.a, t.b, t.u, t.t, a.z FROM ( SELECT md5(a) AS md5_a, count(DISTINCT b) AS z FROM t GROUP BY 1 HAVING count(DISTINCT b) > 2 ) a JOIN t ON md5(t.a) = md5_a ORDER BY 1, 2, 3, 4; -- optional
Or probably faster, yet:
SELECT a, b, u, t, z FROM ( SELECT DISTINCT ON (1, 2, 3, 4) md5(t.a) AS md5_a, t.b, t.u, t.t, t.a FROM t ) t JOIN ( SELECT md5(a) AS md5_a, count(DISTINCT b) AS z FROM t GROUP BY 1 HAVING count(DISTINCT b) > 2 ) z USING (md5_a) ORDER BY 1, 2, 3, 4; -- optional
Detailed explanation for DISTINCT ON
: