I am trying to optimize a query in Postgres, without success.
Here is my table:
CREATE TABLE IF NOT EXISTS voc_cc348779bdc84f8aab483f662a798a6a ( id SERIAL, date TIMESTAMP, text TEXT, themes JSONB, meta JSONB, canal VARCHAR(255), source VARCHAR(255), file VARCHAR(255) );
I have indexes on id
and meta
columns:
CREATE UNIQUE INDEX voc_cc348779bdc84f8aab483f662a798a6a_id ON voc_cc348779bdc84f8aab483f662a798a6a USING btree(id); CREATE INDEX voc_cc348779bdc84f8aab483f662a798a6a_meta ON voc_cc348779bdc84f8aab483f662a798a6a USING btree(meta);
There is 62k rows in this table.
The request I’m trying to optimize is this one:
SELECT meta_split.key, meta_split.value, COUNT(DISTINCT(id)) FROM voc_cc348779bdc84f8aab483f662a798a6a LEFT JOIN LATERAL jsonb_each(voc_cc348779bdc84f8aab483f662a798a6a.meta) AS meta_split ON TRUE WHERE meta_split.value IS NOT NULL GROUP BY meta_split.key, meta_split.value;
In this query, meta is a dict like this one:
{ "Age":"50 to 59 yo", "Kids":"No kid", "Gender":"Male" }
I want to get the full list of key / value + the count of row for each of those. Here is a result of an EXPLAIN ANALYZE VERBOSE of my request:
GroupAggregate (cost=1138526.13..1201099.13 rows=100 width=72) (actual time=2016.984..2753.058 rows=568 loops=1) Output: meta_split.key, meta_split.value, count(DISTINCT voc_cc348779bdc84f8aab483f662a798a6a.id) Group Key: meta_split.key, meta_split.value -> Sort (cost=1138526.13..1154169.13 rows=6257200 width=68) (actual time=2015.501..2471.027 rows=563148 loops=1) Output: meta_split.key, meta_split.value, voc_cc348779bdc84f8aab483f662a798a6a.id Sort Key: meta_split.key, meta_split.value Sort Method: external merge Disk: 26672kB -> Nested Loop (cost=0.00..131538.72 rows=6257200 width=68) (actual time=0.029..435.456 rows=563148 loops=1) Output: meta_split.key, meta_split.value, voc_cc348779bdc84f8aab483f662a798a6a.id -> Seq Scan on public.voc_cc348779bdc84f8aab483f662a798a6a (cost=0.00..6394.72 rows=62572 width=294) (actual time=0.007..16.588 rows=62572 loops=1) Output: voc_cc348779bdc84f8aab483f662a798a6a.id, voc_cc348779bdc84f8aab483f662a798a6a.date, voc_cc348779bdc84f8aab483f662a798a6a.text, voc_cc348779bdc84f8aab483f662a798a6a.themes, voc_cc348779bdc84f8aab483f662a798a6a.meta, voc_cc348779bdc84f8aab483f662a798a6a.canal, voc_cc348779bdc84f8aab483f662a798a6a.source, voc_cc348779bdc84f8aab483f662a798a6a.file -> Function Scan on pg_catalog.jsonb_each meta_split (cost=0.00..1.00 rows=100 width=64) (actual time=0.005..0.005 rows=9 loops=62572) Output: meta_split.key, meta_split.value Function Call: jsonb_each(voc_cc348779bdc84f8aab483f662a798a6a.meta) Filter: (meta_split.value IS NOT NULL) Planning Time: 1.502 ms Execution Time: 2763.309 ms
I tried to change COUNT(DISTINCT(id))
to COUNT(DISTINCT voc_cc348779bdc84f8aab483f662a798a6a.*)
or using subqueries, resulting respectively in x10 and x30 time slower. I also thought about maintaining a separate table with those count; I can’t however do so since I need to filter the results (like, sometimes the query has a filter on the date
column or the like).
I don’t really know how to optimize it further, but it’s quite slow with such a small rows count – I expect to have ten time this number later, and that would be too slow if the speed scale with the number, as it did with the first 62k.
Advertisement
Answer
Assuming id
not only UNIQUE
– as enforced by your UNIQUE INDEX
– but also NOT NULL
. (That’s missing in your table definition.)
SELECT meta_split.key, meta_split.value, count(*) FROM voc_cc348779bdc84f8aab483f662a798a6a v CROSS JOIN LATERAL jsonb_each(v.meta) AS meta_split GROUP BY meta_split.key, meta_split.value;
Shorter equivalent:
SELECT meta_split.key, meta_split.value, count(*) FROM voc_cc348779bdc84f8aab483f662a798a6a v, jsonb_each(v.meta) AS meta_split GROUP BY 1, 2;
The LEFT [OUTER] JOIN
was noise because the following test WHERE meta_split.value IS NOT NULL
forces an INNER JOIN
anyway. Using CROSS JOIN
instead.
Also, since jsonb
does not allow duplicate keys on the same level anyway (meaning the same id
can only pop up once per (key, value)
), DISTINCT
is just expensive noise. count(v.id)
does the same cheaper. And count(*)
is equivalent, and cheaper, yet – assuming id
is NOT NULL
as stated at the top.
count(*)
has a separate implementation and is slightly faster than count(<value>)
. It’s subtly different from count(v.*)
. It counts all rows, no matter what. While the other form does not count NULL
values.
That is, as long as id
cannot be NULL
– as stated at the top. id
should really be the PRIMARY KEY
, which is implemented with a unique B-tree index internally anyway, and all columns – just id
here – are NOT NULL
implicitly. Or at least NOT NULL
. A UNIQUE INDEX
does not fully qualify as replacement, it still allows NULL
values which are not considered equal and are allowed multiple times. See:
Apart from that, indexes are of no use here, as all rows have to be read anyway. So this is never going to be very cheap. But 62k rows is not a crippling row count by any means – unless you have huge numbers of keys in the jsonb
column.
The remaining options to speed it up:
Normalize your design. Unnesting JSON documents is not free of cost.
Maintain a materialized view. Feasibility and costs strongly depends on your write patterns.
… sometimes the query has a filter on the
date
column or the like.
That’s where indexes may play a role again …