I wrote a function to create posts for a simple blogging engine:
CREATE FUNCTION CreatePost(VARCHAR, TEXT, VARCHAR[]) RETURNS INTEGER AS $$ DECLARE InsertedPostId INTEGER; TagName VARCHAR; BEGIN INSERT INTO Posts (Title, Body) VALUES ($1, $2) RETURNING Id INTO InsertedPostId; FOREACH TagName IN ARRAY $3 LOOP DECLARE InsertedTagId INTEGER; BEGIN -- I am concerned about this part. BEGIN INSERT INTO Tags (Name) VALUES (TagName) RETURNING Id INTO InsertedTagId; EXCEPTION WHEN UNIQUE_VIOLATION THEN SELECT INTO InsertedTagId Id FROM Tags WHERE Name = TagName FETCH FIRST ROW ONLY; END; INSERT INTO Taggings (PostId, TagId) VALUES (InsertedPostId, InsertedTagId); END; END LOOP; RETURN InsertedPostId; END; $$ LANGUAGE 'plpgsql';
Is this prone to race conditions when multiple users delete tags and create posts at the same time?
Specifically, do transactions (and thus functions) prevent such race conditions from happening?
I’m using PostgreSQL 9.2.3.
Advertisement
Answer
It’s the recurring problem of SELECT
or INSERT
under possible concurrent write load, related to (but different from) UPSERT
(which is INSERT
or UPDATE
).
This PL/pgSQL function uses UPSERT (INSERT ... ON CONFLICT .. DO UPDATE
) to INSERT
or SELECT
a single row:
CREATE OR REPLACE FUNCTION f_tag_id(_tag text, OUT _tag_id int) LANGUAGE plpgsql AS $func$ BEGIN SELECT tag_id -- only if row existed before FROM tag WHERE tag = _tag INTO _tag_id; IF NOT FOUND THEN INSERT INTO tag AS t (tag) VALUES (_tag) ON CONFLICT (tag) DO NOTHING RETURNING t.tag_id INTO _tag_id; END IF; END $func$;
There is still a tiny window for a race condition. To make absolutely sure we get an ID:
CREATE OR REPLACE FUNCTION f_tag_id(_tag text, OUT _tag_id int) LANGUAGE plpgsql AS $func$ BEGIN LOOP SELECT tag_id FROM tag WHERE tag = _tag INTO _tag_id; EXIT WHEN FOUND; INSERT INTO tag AS t (tag) VALUES (_tag) ON CONFLICT (tag) DO NOTHING RETURNING t.tag_id INTO _tag_id; EXIT WHEN FOUND; END LOOP; END $func$;
db<>fiddle here
This keeps looping until either INSERT
or SELECT
succeeds.
Call:
SELECT f_tag_id('possibly_new_tag');
If subsequent commands in the same transaction rely on the existence of the row and it is actually possible that other transactions update or delete it concurrently, you can lock an existing row in the SELECT
statement with FOR SHARE
.
If the row gets inserted instead, it is locked (or not visible for other transactions) until the end of the transaction anyway.
Start with the common case (INSERT
vs SELECT
) to make it faster.
Related:
- Get Id from a conditional INSERT
- How to include excluded rows in RETURNING from INSERT … ON CONFLICT
Related (pure SQL) solution to INSERT
or SELECT
multiple rows (a set) at once:
What’s wrong with this pure SQL solution?
CREATE OR REPLACE FUNCTION f_tag_id(_tag text, OUT _tag_id int) LANGUAGE sql AS $func$ WITH ins AS ( INSERT INTO tag AS t (tag) VALUES (_tag) ON CONFLICT (tag) DO NOTHING RETURNING t.tag_id ) SELECT tag_id FROM ins UNION ALL SELECT tag_id FROM tag WHERE tag = _tag LIMIT 1; $func$;
Not entirely wrong, but it fails to seal a loophole, like @FunctorSalad worked out. The function can come up with an empty result if a concurrent transaction tries to do the same at the same time. The manual:
All the statements are executed with the same snapshot
If a concurrent transaction inserts the same new tag a moment earlier, but hasn’t committed, yet:
The UPSERT part comes up empty, after waiting for the concurrent transaction to finish. (If the concurrent transaction should roll back, it still inserts the new tag and returns a new ID.)
The SELECT part also comes up empty, because it’s based on the same snapshot, where the new tag from the (yet uncommitted) concurrent transaction is not visible.
We get nothing. Not as intended. That’s counter-intuitive to naive logic (and I got caught there), but that’s how the MVCC model of Postgres works – has to work.
So do not use this if multiple transactions can try to insert the same tag at the same time. Or loop until you actually get a row. The loop will hardly ever be triggered in common work loads anyway.
Postgres 9.4 or older
Given this (slightly simplified) table:
CREATE table tag ( tag_id serial PRIMARY KEY , tag text UNIQUE );
An almost 100% secure function to insert new tag / select existing one, could look like this.
CREATE OR REPLACE FUNCTION f_tag_id(_tag text, OUT tag_id int) LANGUAGE plpgsql AS $func$ BEGIN LOOP BEGIN WITH sel AS (SELECT t.tag_id FROM tag t WHERE t.tag = _tag FOR SHARE) , ins AS (INSERT INTO tag(tag) SELECT _tag WHERE NOT EXISTS (SELECT 1 FROM sel) -- only if not found RETURNING tag.tag_id) -- qualified so no conflict with param SELECT sel.tag_id FROM sel UNION ALL SELECT ins.tag_id FROM ins INTO tag_id; EXCEPTION WHEN UNIQUE_VIOLATION THEN -- insert in concurrent session? RAISE NOTICE 'It actually happened!'; -- hardly ever happens END; EXIT WHEN tag_id IS NOT NULL; -- else keep looping END LOOP; END $func$;
Why not 100%? Consider the notes in the manual for the related UPSERT
example:
Explanation
Try the
SELECT
first. This way you avoid the considerably more expensive exception handling 99.99% of the time.Use a CTE to minimize the (already tiny) time slot for the race condition.
The time window between the
SELECT
and theINSERT
within one query is super tiny. If you don’t have heavy concurrent load, or if you can live with an exception once a year, you could just ignore the case and use the SQL statement, which is faster.No need for
FETCH FIRST ROW ONLY
(=LIMIT 1
). The tag name is obviouslyUNIQUE
.Remove
FOR SHARE
in my example if you don’t usually have concurrentDELETE
orUPDATE
on the tabletag
. Costs a tiny bit of performance.Never quote the language name:
‘plpgsql’.plpgsql
is an identifier. Quoting may cause problems and is only tolerated for backwards compatibility.Don’t use non-descriptive column names like
id
orname
. When joining a couple of tables (which is what you do in a relational DB) you end up with multiple identical names and have to use aliases.
Built into your function
Using this function you could largely simplify your FOREACH LOOP
to:
... FOREACH TagName IN ARRAY $3 LOOP INSERT INTO taggings (PostId, TagId) VALUES (InsertedPostId, f_tag_id(TagName)); END LOOP; ...
Faster, though, as a single SQL statement with unnest()
:
INSERT INTO taggings (PostId, TagId) SELECT InsertedPostId, f_tag_id(tag) FROM unnest($3) tag;
Replaces the whole loop.
Alternative solution
This variant builds on the behavior of UNION ALL
with a LIMIT
clause: as soon as enough rows are found, the rest is never executed:
Building on this, we can outsource the INSERT
into a separate function. Only there we need exception handling. Just as safe as the first solution.
CREATE OR REPLACE FUNCTION f_insert_tag(_tag text, OUT tag_id int) RETURNS int LANGUAGE plpgsql AS $func$ BEGIN INSERT INTO tag(tag) VALUES (_tag) RETURNING tag.tag_id INTO tag_id; EXCEPTION WHEN UNIQUE_VIOLATION THEN -- catch exception, NULL is returned END $func$;
Which is used in the main function:
CREATE OR REPLACE FUNCTION f_tag_id(_tag text, OUT _tag_id int) LANGUAGE plpgsql AS $func$ BEGIN LOOP SELECT tag_id FROM tag WHERE tag = _tag UNION ALL SELECT f_insert_tag(_tag) -- only executed if tag not found LIMIT 1 -- not strictly necessary, just to be clear INTO _tag_id; EXIT WHEN _tag_id IS NOT NULL; -- else keep looping END LOOP; END $func$;
This is a bit cheaper if most of the calls only need
SELECT
, because the more expensive block withINSERT
containing theEXCEPTION
clause is rarely entered. The query is also simpler.FOR SHARE
is not possible here (not allowed inUNION
query).LIMIT 1
would not be necessary (tested in pg 9.4). Postgres derivesLIMIT 1
fromINTO _tag_id
and only executes until the first row is found.