Optimizing GROUP BY + COUNT DISTINCT on unnested jsonb column

Question

I am trying to optimize a query in Postgres, without success. Here is my table: I have indexes on id and meta columns: There is 62k rows in this table. The request I'm trying to optimize is this one: In this query, meta is a dict like this one: I want to get the full list of key / value

Accepted Answer

Assuming id not only UNIQUE &#8211; as enforced by your UNIQUE INDEX &#8211; but also NOT NULL. (That&#8217;s missing in your table definition.)SELECT meta_split.key, meta_split.value, count(*)FROM   voc_cc348779bdc84f8aab483f662a798a6a vCROSS  JOIN LATERAL jsonb_each(v.meta) AS meta_splitGROUP  BY meta_split.key, meta_split.value;Shorter equivalent:SELECT meta_split.key, meta_split.value, count(*)FROM   voc_cc348779bdc84f8aab483f662a798a6a v, jsonb_each(v.meta) AS meta_splitGROUP  BY 1, 2;The LEFT [OUTER] JOIN was noise because the following test WHERE meta_split.value IS NOT NULL forces an INNER JOIN anyway. Using CROSS JOIN instead.Also, since jsonb does not allow duplicate keys on the same level anyway (meaning the same id can only pop up once per (key, value)), DISTINCT is just expensive noise. count(v.id) does the same cheaper. And count(*) is equivalent, and cheaper, yet &#8211; assuming id is NOT NULL as stated at the top.count(*) has a separate implementation and is slightly faster than count(<value>). It&#8217;s subtly different from count(v.*). It counts all rows, no matter what. While the other form does not count NULL values.That is, as long as id cannot be NULL &#8211; as stated at the top. id should really be the PRIMARY KEY, which is implemented with a unique B-tree index internally anyway, and all columns &#8211; just id here &#8211; are NOT NULL implicitly. Or at least NOT NULL. A UNIQUE INDEX does not fully qualify as replacement, it still allows NULL values which are not considered equal and are allowed multiple times. See:Why can I create a table with PRIMARY KEY on a nullable column?Create unique constraint with null columnsApart from that, indexes are of no use here, as all rows have to be read anyway. So this is never going to be very cheap. But 62k rows is not a crippling row count by any means &#8211; unless you have huge numbers of keys in the jsonb column.The remaining options to speed it up:Normalize your design. Unnesting JSON documents is not free of cost.Maintain a materialized view. Feasibility and costs strongly depends on your write patterns.&#8230; sometimes the query has a filter on the date column or the like.That&#8217;s where indexes may play a role again &#8230;

Advertisement

Answer