Skip to content
Advertisement

Optimizing GROUP BY + COUNT DISTINCT on unnested jsonb column

I am trying to optimize a query in Postgres, without success.

Here is my table:

CREATE TABLE IF NOT EXISTS voc_cc348779bdc84f8aab483f662a798a6a (
  id SERIAL,
  date TIMESTAMP,
  text TEXT,
  themes JSONB,
  meta JSONB,
  canal VARCHAR(255),
  source VARCHAR(255),
  file VARCHAR(255)
);

I have indexes on id and meta columns:

CREATE UNIQUE INDEX voc_cc348779bdc84f8aab483f662a798a6a_id ON voc_cc348779bdc84f8aab483f662a798a6a USING btree(id);
CREATE INDEX voc_cc348779bdc84f8aab483f662a798a6a_meta ON voc_cc348779bdc84f8aab483f662a798a6a USING btree(meta);

There is 62k rows in this table.

The request I’m trying to optimize is this one:

SELECT meta_split.key, meta_split.value, COUNT(DISTINCT(id))
    FROM voc_cc348779bdc84f8aab483f662a798a6a
    LEFT JOIN LATERAL jsonb_each(voc_cc348779bdc84f8aab483f662a798a6a.meta)
    AS meta_split ON TRUE
    WHERE meta_split.value IS NOT NULL
    GROUP BY meta_split.key, meta_split.value;

In this query, meta is a dict like this one:

{
"Age":"50 to 59 yo",
"Kids":"No kid",
"Gender":"Male"
}

I want to get the full list of key / value + the count of row for each of those. Here is a result of an EXPLAIN ANALYZE VERBOSE of my request:

GroupAggregate  (cost=1138526.13..1201099.13 rows=100 width=72) (actual time=2016.984..2753.058 rows=568 loops=1)
  Output: meta_split.key, meta_split.value, count(DISTINCT voc_cc348779bdc84f8aab483f662a798a6a.id)
  Group Key: meta_split.key, meta_split.value
  ->  Sort  (cost=1138526.13..1154169.13 rows=6257200 width=68) (actual time=2015.501..2471.027 rows=563148 loops=1)
        Output: meta_split.key, meta_split.value, voc_cc348779bdc84f8aab483f662a798a6a.id
        Sort Key: meta_split.key, meta_split.value
        Sort Method: external merge  Disk: 26672kB
        ->  Nested Loop  (cost=0.00..131538.72 rows=6257200 width=68) (actual time=0.029..435.456 rows=563148 loops=1)
              Output: meta_split.key, meta_split.value, voc_cc348779bdc84f8aab483f662a798a6a.id
              ->  Seq Scan on public.voc_cc348779bdc84f8aab483f662a798a6a  (cost=0.00..6394.72 rows=62572 width=294) (actual time=0.007..16.588 rows=62572 loops=1)
                    Output: voc_cc348779bdc84f8aab483f662a798a6a.id, voc_cc348779bdc84f8aab483f662a798a6a.date, voc_cc348779bdc84f8aab483f662a798a6a.text, voc_cc348779bdc84f8aab483f662a798a6a.themes, voc_cc348779bdc84f8aab483f662a798a6a.meta, voc_cc348779bdc84f8aab483f662a798a6a.canal, voc_cc348779bdc84f8aab483f662a798a6a.source, voc_cc348779bdc84f8aab483f662a798a6a.file
              ->  Function Scan on pg_catalog.jsonb_each meta_split  (cost=0.00..1.00 rows=100 width=64) (actual time=0.005..0.005 rows=9 loops=62572)
                    Output: meta_split.key, meta_split.value
                    Function Call: jsonb_each(voc_cc348779bdc84f8aab483f662a798a6a.meta)
                    Filter: (meta_split.value IS NOT NULL)
Planning Time: 1.502 ms
Execution Time: 2763.309 ms

I tried to change COUNT(DISTINCT(id)) to COUNT(DISTINCT voc_cc348779bdc84f8aab483f662a798a6a.*) or using subqueries, resulting respectively in x10 and x30 time slower. I also thought about maintaining a separate table with those count; I can’t however do so since I need to filter the results (like, sometimes the query has a filter on the date column or the like).

I don’t really know how to optimize it further, but it’s quite slow with such a small rows count – I expect to have ten time this number later, and that would be too slow if the speed scale with the number, as it did with the first 62k.

Advertisement

Answer

Assuming id not only UNIQUE – as enforced by your UNIQUE INDEX – but also NOT NULL. (That’s missing in your table definition.)

SELECT meta_split.key, meta_split.value, count(*)
FROM   voc_cc348779bdc84f8aab483f662a798a6a v
CROSS  JOIN LATERAL jsonb_each(v.meta) AS meta_split
GROUP  BY meta_split.key, meta_split.value;

Shorter equivalent:

SELECT meta_split.key, meta_split.value, count(*)
FROM   voc_cc348779bdc84f8aab483f662a798a6a v, jsonb_each(v.meta) AS meta_split
GROUP  BY 1, 2;

The LEFT [OUTER] JOIN was noise because the following test WHERE meta_split.value IS NOT NULL forces an INNER JOIN anyway. Using CROSS JOIN instead.

Also, since jsonb does not allow duplicate keys on the same level anyway (meaning the same id can only pop up once per (key, value)), DISTINCT is just expensive noise. count(v.id) does the same cheaper. And count(*) is equivalent, and cheaper, yet – assuming id is NOT NULL as stated at the top.

count(*) has a separate implementation and is slightly faster than count(<value>). It’s subtly different from count(v.*). It counts all rows, no matter what. While the other form does not count NULL values.

That is, as long as id cannot be NULL – as stated at the top. id should really be the PRIMARY KEY, which is implemented with a unique B-tree index internally anyway, and all columns – just id here – are NOT NULL implicitly. Or at least NOT NULL. A UNIQUE INDEX does not fully qualify as replacement, it still allows NULL values which are not considered equal and are allowed multiple times. See:

Apart from that, indexes are of no use here, as all rows have to be read anyway. So this is never going to be very cheap. But 62k rows is not a crippling row count by any means – unless you have huge numbers of keys in the jsonb column.

The remaining options to speed it up:

  1. Normalize your design. Unnesting JSON documents is not free of cost.

  2. Maintain a materialized view. Feasibility and costs strongly depends on your write patterns.

… sometimes the query has a filter on the date column or the like.

That’s where indexes may play a role again …

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement