Skip to content
Advertisement

PostgresSQL: Get first and last values of an entities chain (self referential table)

I have a table where entities are events. Each event belongs to a group. Each time a new event for this group is inserted, it refers last event of same group on the attibute field “prior” (self referential FK). The other attributes are the id of the event itself, the group which the event belongs to and if this event were or not deleted (the deleted ones must be desconsidered)

For example:

id grp prior deleted
1 1 NULL false
2 2 NULL false
3 3 NULL false
4 1 1 false
5 3 3 false
6 2 2 false
7 2 6 false
8 2 7 false
9 3 5 false
10 3 9 false
11 2 8 true
12 2 8 false
13 1 4 false
14 2 12 false
15 1 13 false

In this case, the events sequence for each group are:

Group Sequence
Group 1 1 → 4 → 13 → 15
Group 2 2 → 6 → 7 → 8 → 12 → 14
Group 3 3 → 5 → 9 → 10

I want to find the first and last event for each group.

group first_event last_event
1 1 15
2 2 14
3 3 10

For this I’m trying to run this query

WITH ev_priorposterior AS (SELECT grp, id, prior, (SELECT ev2.id FROM events ev2 WHERE ev2.prior = ev.id AND deleted = false) posterior
FROM events ev
WHERE deleted = FALSE
ORDER BY ev.id)

SELECT events.grp, 
(SELECT id FROM ev_priorposterior epp WHERE prior IS NULL AND events.grp = epp.grp) first_event, 
(SELECT id FROM ev_priorposterior epp WHERE posterior IS NULL AND events.grp = epp.grp) last_event
FROM events
GROUP BY grp

The fisrt query, at WITH statement, is to generate a view with prior and posterior event for each entry and the second to get, for each group, the first and last events.

The query worked, but in my database, with about 7500 records of events and 4900 records of groups, it’s taking about 40 seconds to run.

I would like to understand why this query is taking so long and how I can improve their performance.

I know that I can workaround by using ID number, which is squential for events (Get maximum and minimum ID for each group). Or even a timestamp field that is also presents at the original database (Get maximum and minimum timestamp for each group), but I really desire a more general solution, based on the chain anlysis.

Exemple of my solution (low performance): dbfiddle.uk

Advertisement

Answer

You are walking the graph forward and backward for each node. That can be taxing for the engine.

You can try:

with recursive
n as (
  select id, grp, prior, 1 as place from events where prior is null
 union all
  select t.id, t.grp, t.prior, n.place + 1
  from n
  join events t on t.prior = n.id and not t.deleted
)
select grp, 
  max(case when f then id end) as first_event, 
  max(case when not f then id end) as last_event
from (
  (select distinct on (grp) true as f, * from n order by grp, place)
 union all
  (select distinct on (grp) false, * from n order by grp, place desc)
) x
group by grp

Example at db<>fiddle. As you can see this solution has a cost of 1436 compared to your query that has a cost of 43770.

The query above only walks the graph forward and makes single passes on the subqueries; it should be decently fast.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement