I have a table with five relevant fields – id
, source
, iid
, track_hash
, alias
. I want to group all entries into groups with a common track_hash
and then for each row save the id
of the row with the lowest source
(with ties broken in favor of the highest iid
) entry from its group into the alias
field. To do this I wrote the following query:
with best as (SELECT id as bid, track_hash FROM (SELECT id, track_hash, RANK () OVER ( PARTITION BY track_hash ORDER BY source asc, iid DESC ) rank from albums ) where rank = 1 ) select bid, a.* from albums a inner join best on a.track_hash = best.track_hash
This takes a completely reasonable 2 seconds on 24k rows. Now, instead of simply seeing this id
, I want to actually save it. For this, I used the following very similar query:
with best as (SELECT id as bid, track_hash FROM (SELECT id, track_hash, RANK () OVER ( PARTITION BY track_hash ORDER BY source asc, iid DESC ) rank from albums ) where rank = 1 ) update albums set alias = bid FROM albums a inner join best on a.track_hash = best.track_hash
However, this one takes anywhere between 1 and 10 minutes, and I really don’t understand why. Doesn’t the engine have to match every row to its best.id
/alias
anyway, which is exactly what I’m doing with my update? Why is this happening and what am I doing wrong?
Query plan looks like this:
MATERIALIZE 1 CO-ROUTINE 4 SCAN TABLE albums USING INDEX track_hash_idx USE TEMP B-TREE FOR RIGHT PART OF ORDER BY SCAN SUBQUERY 4 SCAN TABLE albums USING COVERING INDEX track_hash_idx SEARCH SUBQUERY 1 USING AUTOMATIC PARTIAL COVERING INDEX (rank=?) SEARCH TABLE albums AS a USING COVERING INDEX track_hash_idx (track_hash=?)
Advertisement
Answer
You don’t need the join to albums
(again).
The UPDATE ... FROM
syntax provides actually an implicit join of albums
to best
:
UPDATE albums AS a SET alias = b.bid FROM best AS b WHERE a.track_hash = b.track_hash