SELECT FOR UPDATE becomes slow with time

We have a table with 1B entries and there are 4 processes which work on this simultaneously. They claim rows with their session ids with 1000 rows at a time and then update the table after 10,000 rows. The query used for claiming is:

EXPLAIN (ANALYZE,BUFFERS) WITH b AS 
(
   SELECT
      userid,
      address
   FROM
      UserEvents 
   WHERE
      deliveryId = 2108625 
      AND 
      (
         tsExpire > GetDate() 
         OR tsExpire IS NULL
      )
      AND sendTime <= GetDate() 
      AND session_id = 0 
      AND level IN 
      (
         'default'
      )
   ORDER BY
      sendTime FOR 
      UPDATE
         SKIP LOCKED LIMIT 1000 
)
UPDATE
   UserEvents e 
SET
   session_id = 1 
FROM
b
WHERE
   e.userid = b.userid RETURNING b.userid,
   b.address

This query generally runs within 500ms when all 4 processes are running simultaneously. Suddenly in the last few runs, it has been slowing down significantly with time. Here are the explain plans:

"Update on UserEvents e  (cost=5753.03..8567.46 rows=1000 width=1962) (actual time=1373.284..1422.244 rows=1000 loops=1)"
"  Buffers: shared hit=1146920 read=59 dirtied=194"
"  I/O Timings: read=13.916"
"  CTE b"
"    ->  Limit  (cost=0.56..5752.46 rows=1000 width=82) (actual time=1373.094..1380.853 rows=1000 loops=1)"
"          Buffers: shared hit=1121721 read=27 dirtied=23"
"          I/O Timings: read=3.440"
"          ->  LockRows  (cost=0.56..179683.91 rows=31239 width=82) (actual time=1373.093..1380.775 rows=1000 loops=1)"
"                Buffers: shared hit=1121721 read=27 dirtied=23"
"                I/O Timings: read=3.440"
"                ->  Index Scan using UserEvents_nextpass2 on UserEvents  (cost=0.56..179371.52 rows=31239 width=82) (actual time=1366.046..1373.339 rows=4186 loops=1)"
"                      Index Cond: ((deliveryId = 2108625) AND (sendTime <= '2020-04-15 08:33:57.372282+00'::timestamp with time zone))"
"                      Filter: (((tsexpire > '2020-04-15 08:33:57.372282+00'::timestamp with time zone) OR (tsexpire IS NULL)) AND (session_id = 0) AND ((level)::text = 'default'::text))"
"                      Rows Removed by Filter: 29614"
"                      Buffers: shared hit=1113493 read=27"
"                      I/O Timings: read=3.440"
"  ->  Nested Loop  (cost=0.58..2815.00 rows=1000 width=1962) (actual time=1373.218..1389.995 rows=1000 loops=1)"
"        Buffers: shared hit=1126728 read=27 dirtied=23"
"        I/O Timings: read=3.440"
"        ->  CTE Scan on b  (cost=0.00..20.00 rows=1000 width=1692) (actual time=1373.106..1382.263 rows=1000 loops=1)"
"              Buffers: shared hit=1121721 read=27 dirtied=23"
"              I/O Timings: read=3.440"
"        ->  Index Scan using UserEvents_id on UserEvents e  (cost=0.58..2.79 rows=1 width=268) (actual time=0.007..0.007 rows=1 loops=1000)"
"              Index Cond: (userid = b.userid)"
"              Buffers: shared hit=5007"
"Planning Time: 0.331 ms"
"Execution Time: 1422.457 ms"

Surprisingly the index scan on UserEvents_nextpass2 slows down significantly after this query is called a few thousand times. This is a partial index on non-null sendTime values. sendTime is updated after each process updates the rows and removes their session ids. But this has been the case for the last 1B events, what could be the reason for this slowness? Any help would be appreciated.

Explain plan for relatively faster run with 700ms:

"Update on UserEvents e  (cost=5707.45..8521.87 rows=1000 width=1962) (actual time=695.897..751.557 rows=1000 loops=1)"
"  Buffers: shared hit=605921 read=68 dirtied=64"
"  I/O Timings: read=27.139"
"  CTE b"
"    ->  Limit  (cost=0.56..5706.87 rows=1000 width=82) (actual time=695.616..707.835 rows=1000 loops=1)"
"          Buffers: shared hit=580158 read=33 dirtied=29"
"          I/O Timings: read=10.491"
"          ->  LockRows  (cost=0.56..179686.41 rows=31489 width=82) (actual time=695.615..707.770 rows=1000 loops=1)"
"                Buffers: shared hit=580158 read=33 dirtied=29"
"                I/O Timings: read=10.491"
"                ->  Index Scan using UserEvents_nextpass2 on UserEvents  (cost=0.56..179371.52 rows=31489 width=82) (actual time=691.529..704.076 rows=3000 loops=1)"
"                      Index Cond: ((deliveryId = 2108625) AND (sendTime <= '2020-04-15 07:42:42.856859+00'::timestamp with time zone))"
"                      Filter: (((tsexpire > '2020-04-15 07:42:42.856859+00'::timestamp with time zone) OR (tsexpire IS NULL)) AND (session_id = 0) AND ((level)::text = 'default'::text))"
"                      Rows Removed by Filter: 29722"
"                      Buffers: shared hit=573158 read=33"
"                      I/O Timings: read=10.491"
"  ->  Nested Loop  (cost=0.58..2815.00 rows=1000 width=1962) (actual time=695.658..716.356 rows=1000 loops=1)"
"        Buffers: shared hit=585165 read=33 dirtied=29"
"        I/O Timings: read=10.491"
"        ->  CTE Scan on b  (cost=0.00..20.00 rows=1000 width=1692) (actual time=695.628..709.116 rows=1000 loops=1)"
"              Buffers: shared hit=580158 read=33 dirtied=29"
"              I/O Timings: read=10.491"
"        ->  Index Scan using UserEvents_id on UserEvents e  (cost=0.58..2.79 rows=1 width=268) (actual time=0.007..0.007 rows=1 loops=1000)"
"              Index Cond: (userid = b.userid)"
"              Buffers: shared hit=5007"
"Planning Time: 0.584 ms"
"Execution Time: 751.713 ms"

My index on this table is:

CREATE INDEX UserEvents_nextpass2 ON public.UserEvents USING btree (deliveryid ASC NULLS LAST, sendTime ASC NULLS LAST) WHERE sendTime IS NOT NULL;

Answer

Index Scan using UserEvents_nextpass2 on UserEvents  (cost=0.56..179371.52 rows=31239 width=82) (actual time=1366.046..1373.339 rows=4186 loops=1)"
      Buffers: shared hit=1113493 read=27"

It looks there is a lot of obsolete data in the “UserEvents_nextpass2” index. Visiting 266 pages for every row returned is a bit ridiculous. Do you have any long-open transactions which are blocking VACUUM and btree-specific microvacuum from doing their job effectively?

Look in pg_stat_activity. Also, is hotstandby_feedback on? Is vacuum_defer_cleanup_age not zero?

Advertisement

Answer