Skip to content
Advertisement

Optimizing GROUP BY + COUNT DISTINCT on unnested jsonb column

I am trying to optimize a query in Postgres, without success.

Here is my table:

I have indexes on id and meta columns:

There is 62k rows in this table.

The request I’m trying to optimize is this one:

In this query, meta is a dict like this one:

I want to get the full list of key / value + the count of row for each of those. Here is a result of an EXPLAIN ANALYZE VERBOSE of my request:

I tried to change COUNT(DISTINCT(id)) to COUNT(DISTINCT voc_cc348779bdc84f8aab483f662a798a6a.*) or using subqueries, resulting respectively in x10 and x30 time slower. I also thought about maintaining a separate table with those count; I can’t however do so since I need to filter the results (like, sometimes the query has a filter on the date column or the like).

I don’t really know how to optimize it further, but it’s quite slow with such a small rows count – I expect to have ten time this number later, and that would be too slow if the speed scale with the number, as it did with the first 62k.

Advertisement

Answer

Assuming id not only UNIQUE – as enforced by your UNIQUE INDEX – but also NOT NULL. (That’s missing in your table definition.)

Shorter equivalent:

The LEFT [OUTER] JOIN was noise because the following test WHERE meta_split.value IS NOT NULL forces an INNER JOIN anyway. Using CROSS JOIN instead.

Also, since jsonb does not allow duplicate keys on the same level anyway (meaning the same id can only pop up once per (key, value)), DISTINCT is just expensive noise. count(v.id) does the same cheaper. And count(*) is equivalent, and cheaper, yet – assuming id is NOT NULL as stated at the top.

count(*) has a separate implementation and is slightly faster than count(<value>). It’s subtly different from count(v.*). It counts all rows, no matter what. While the other form does not count NULL values.

That is, as long as id cannot be NULL – as stated at the top. id should really be the PRIMARY KEY, which is implemented with a unique B-tree index internally anyway, and all columns – just id here – are NOT NULL implicitly. Or at least NOT NULL. A UNIQUE INDEX does not fully qualify as replacement, it still allows NULL values which are not considered equal and are allowed multiple times. See:

Apart from that, indexes are of no use here, as all rows have to be read anyway. So this is never going to be very cheap. But 62k rows is not a crippling row count by any means – unless you have huge numbers of keys in the jsonb column.

The remaining options to speed it up:

  1. Normalize your design. Unnesting JSON documents is not free of cost.

  2. Maintain a materialized view. Feasibility and costs strongly depends on your write patterns.

… sometimes the query has a filter on the date column or the like.

That’s where indexes may play a role again …

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement