PostgreSQL distinct rows joined with a count of distinct values in one column

Question

I&#8217;m using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows: There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above. I&#8217;m tryi…

Accepted Answer

Your first step can be simpler already:SELECT md5(a) AS md5_a, count(DISTINCT b) AS zFROM   tGROUP  BY 1HAVING count(DISTINCT b) > 2;Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.Since your table is big, you need an efficient query. This should be among the fastest possible solutions &#8211; with adequate index support. Your index on (md5(a), b) is instrumental but &#8211; assuming b, u, and t are small columns &#8211; an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).Your desired end result:SELECT DISTINCT ON (md5(t.a), b, u, t)       t.a, t.b, t.u, t.t, a.zFROM  (   SELECT md5(a) AS md5_a, count(DISTINCT b) AS z   FROM   t   GROUP  BY 1   HAVING count(DISTINCT b) > 2   ) aJOIN   t ON md5(t.a) = md5_aORDER  BY 1, 2, 3, 4;  -- optionalOr probably faster, yet:SELECT a, b, u, t, zFROM  (   SELECT DISTINCT ON (1, 2, 3, 4)          md5(t.a) AS md5_a, t.b, t.u, t.t, t.a   FROM   t   ) tJOIN  (   SELECT md5(a) AS md5_a, count(DISTINCT b) AS z   FROM   t   GROUP  BY 1   HAVING count(DISTINCT b) > 2   ) z USING (md5_a)ORDER  BY 1, 2, 3, 4;  -- optionalDetailed explanation for DISTINCT ON:Select first row in each GROUP BY group?

Advertisement

Answer