I have a table with five relevant fields – id
, source
, iid
, track_hash
, alias
. I want to group all entries into groups with a common track_hash
and then for each row save the id
of the row with the lowest source
(with ties broken in favor of the highest iid
) entry from its group into the alias
field. To do this I wrote the following query:
with best as
(SELECT id as bid, track_hash FROM
(SELECT id, track_hash,
RANK () OVER (
PARTITION BY track_hash
ORDER BY source asc, iid DESC
) rank
from albums
)
where rank = 1
)
select bid, a.* from albums a inner join best
on a.track_hash = best.track_hash
This takes a completely reasonable 2 seconds on 24k rows. Now, instead of simply seeing this id
, I want to actually save it. For this, I used the following very similar query:
with best as
(SELECT id as bid, track_hash FROM
(SELECT id, track_hash,
RANK () OVER (
PARTITION BY track_hash
ORDER BY source asc, iid DESC
) rank
from albums
)
where rank = 1
)
update albums
set alias = bid FROM albums a inner join best
on a.track_hash = best.track_hash
However, this one takes anywhere between 1 and 10 minutes, and I really don’t understand why. Doesn’t the engine have to match every row to its best.id
/alias
anyway, which is exactly what I’m doing with my update? Why is this happening and what am I doing wrong?
Query plan looks like this:
MATERIALIZE 1
CO-ROUTINE 4
SCAN TABLE albums USING INDEX track_hash_idx
USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
SCAN SUBQUERY 4
SCAN TABLE albums USING COVERING INDEX track_hash_idx
SEARCH SUBQUERY 1 USING AUTOMATIC PARTIAL COVERING INDEX (rank=?)
SEARCH TABLE albums AS a USING COVERING INDEX track_hash_idx (track_hash=?)
Advertisement
Answer
You don’t need the join to albums
(again).
The UPDATE ... FROM
syntax provides actually an implicit join of albums
to best
:
UPDATE albums AS a
SET alias = b.bid
FROM best AS b
WHERE a.track_hash = b.track_hash