I’ve built a system for archiving, aggregating, and indexing my IM history across different platforms (e.g. MSN Messenger, Hangouts, WhatsApp). The data is currently stored in an SQLite3 database, and relevant parts of the schema look something like this:
CREATE TABLE messages ( message_id TEXT PRIMARY KEY, conversation_id TEXT, sender_id TEXT, sent_micros BIGINT, -- microseconds since epoch received_micros BIGINT, read_micros BIGINT, message_plain TEXT, -- various other columns )
There are also NOT NULL
constraints on all columns, but I skipped those to make the schema a little easier to read.
What I’d like to do is implement “search with context”, where records around (before/after) the result are shown. Here’s an example, with fake syntax and a newline added to make it easier to distinguish the result groups:
RETURNING CONTEXT 3 ROWS AROUND PARTITION BY conversation_id ORDER BY sent_micros ASC SELECT message_id, conversation_id, sender_id, sent_micros, message_plain FROM messages WHERE message_plain LIKE '%Waterdeep%' ORDER BY sent_micros ASC, conversation_id ASC;
message_id | conversation_id | sender_id | sent_s | message_plain -----------+-----------------+------------+-------------+----------------------------------------- xkezwE9bMz | Mv3YdQsXZw | vBPkl8G8e4 | 1614762003… | what did you want to play later? PKhydJ3TCM | Mv3YdQsXZw | i3QsU2T2ms | 1614762008… | I dunno umfTYF2Qlv | Mv3YdQsXZw | i3QsU2T2ms | 1614762009… | well actually RJy6MVAQYv | Mv3YdQsXZw | i3QsU2T2ms | 1614762011… | what about Lords of Waterdeep? k2V49Ls1Jj | Mv3YdQsXZw | vBPkl8G8e4 | 1614762314… | sounds good PpUSRDVC9O | Mv3YdQsXZw | vBPkl8G8e4 | 1614762315… | maybe I'll even let you win this time! HNaiAd0zY6 | Mv3YdQsXZw | i3QsU2T2ms | 1614762322… | hah, like that would ever happen TUR5ZvEBMB | P2nclp3jaG | vBPkl8G8e4 | 1614726660… | I'm heading to bed au72U5GaQQ | P2nclp3jaG | vBPkl8G8e4 | 1614726661… | goodnight HFGXbn1X0S | P2nclp3jaG | KcDHZ9zM4c | 1614727012… | 'night E6u30kU7Fo | P2nclp3jaG | vBPkl8G8e4 | 1614762840… | did Sam get the Waterdeep expansion? ssTPVZO1ce | P2nclp3jaG | KcDHZ9zM4c | 1614762921… | yeah WEA69uNJ7a | P2nclp3jaG | KcDHZ9zM4c | 1614762926… | we're hoping to play it soon wktsSF3L3l | P2nclp3jaG | KcDHZ9zM4c | 1614762934… | like maybe this weekend
Note that timestamps are only ordered within a group of records; “I’m heading to bed” was sent before “what did you want to play later?” but the “Waterdeep” rows are ordered correctly.
The only way I can think of to do this is with 2n + 1 queries:
SELECT message_id, conversation_id, sent_micros FROM messages WHERE message_plain LIKE '%Waterdeep%' ORDER BY sent_micros ASC
- For each row in that result:
SELECT * FROM messages WHERE conversation_id = :conversation_id AND sent_micros >= :sent_micros ORDER BY sent_micros ASC LIMIT 4
SELECT * FROM messages WHERE conversation_id = :conversation_id AND sent_micros <= :sent_micros ORDER BY sent_micros DESC LIMIT 4
but this seems very wasteful and inefficient.
Is there a better way to handle this? Could window functions be useful? Would a different database engine help?
It seems likely that at some point I’ll add full-text search capability rather than simple LIKE
-based matching. I imagine this will basically just give the message_id
results for a query, so it feels like some kind of “return n rows around primary key values x, y, z” solution would be ideal. The LIKE
-based query can be decomposed into that form, and would only be two queries rather than 2n + 1
.
Advertisement
Answer
You can do it with ROW_NUMBER()
window function:
WITH cte AS ( SELECT *, message_plain LIKE '%Waterdeep%' flag, ROW_NUMBER() OVER (ORDER BY conversation_id, sent_micros) rn FROM messages ) SELECT c1.message_id, c1.conversation_id, c1.sent_micros FROM cte c1 INNER JOIN cte c2 ON c2.conversation_id = c1.conversation_id AND c1.rn BETWEEN c2.rn - 4 AND c2.rn + 4 WHERE c2.flag