Skip to content
Advertisement

Database search results with context

I’ve built a system for archiving, aggregating, and indexing my IM history across different platforms (e.g. MSN Messenger, Hangouts, WhatsApp). The data is currently stored in an SQLite3 database, and relevant parts of the schema look something like this:

CREATE TABLE messages (
  message_id TEXT PRIMARY KEY,
  conversation_id TEXT,
  sender_id TEXT,
  sent_micros BIGINT,  -- microseconds since epoch
  received_micros BIGINT,
  read_micros BIGINT,
  message_plain TEXT,
  -- various other columns
)

There are also NOT NULL constraints on all columns, but I skipped those to make the schema a little easier to read.

What I’d like to do is implement “search with context”, where records around (before/after) the result are shown. Here’s an example, with fake syntax and a newline added to make it easier to distinguish the result groups:

RETURNING CONTEXT 3 ROWS AROUND PARTITION BY conversation_id ORDER BY sent_micros ASC
SELECT message_id, conversation_id, sender_id, sent_micros, message_plain FROM messages
WHERE message_plain LIKE '%Waterdeep%'
ORDER BY sent_micros ASC, conversation_id ASC;
message_id | conversation_id | sender_id  | sent_s      | message_plain
-----------+-----------------+------------+-------------+-----------------------------------------
xkezwE9bMz | Mv3YdQsXZw      | vBPkl8G8e4 | 1614762003… | what did you want to play later?
PKhydJ3TCM | Mv3YdQsXZw      | i3QsU2T2ms | 1614762008… | I dunno
umfTYF2Qlv | Mv3YdQsXZw      | i3QsU2T2ms | 1614762009… | well actually
RJy6MVAQYv | Mv3YdQsXZw      | i3QsU2T2ms | 1614762011… | what about Lords of Waterdeep?
k2V49Ls1Jj | Mv3YdQsXZw      | vBPkl8G8e4 | 1614762314… | sounds good
PpUSRDVC9O | Mv3YdQsXZw      | vBPkl8G8e4 | 1614762315… | maybe I'll even let you win this time!
HNaiAd0zY6 | Mv3YdQsXZw      | i3QsU2T2ms | 1614762322… | hah, like that would ever happen

TUR5ZvEBMB | P2nclp3jaG      | vBPkl8G8e4 | 1614726660… | I'm heading to bed
au72U5GaQQ | P2nclp3jaG      | vBPkl8G8e4 | 1614726661… | goodnight
HFGXbn1X0S | P2nclp3jaG      | KcDHZ9zM4c | 1614727012… | 'night
E6u30kU7Fo | P2nclp3jaG      | vBPkl8G8e4 | 1614762840… | did Sam get the Waterdeep expansion?
ssTPVZO1ce | P2nclp3jaG      | KcDHZ9zM4c | 1614762921… | yeah
WEA69uNJ7a | P2nclp3jaG      | KcDHZ9zM4c | 1614762926… | we're hoping to play it soon
wktsSF3L3l | P2nclp3jaG      | KcDHZ9zM4c | 1614762934… | like maybe this weekend

Note that timestamps are only ordered within a group of records; “I’m heading to bed” was sent before “what did you want to play later?” but the “Waterdeep” rows are ordered correctly.


The only way I can think of to do this is with 2n + 1 queries:

  1. SELECT message_id, conversation_id, sent_micros FROM messages WHERE message_plain LIKE '%Waterdeep%' ORDER BY sent_micros ASC
  2. For each row in that result:
    • SELECT * FROM messages WHERE conversation_id = :conversation_id AND sent_micros >= :sent_micros ORDER BY sent_micros ASC LIMIT 4
    • SELECT * FROM messages WHERE conversation_id = :conversation_id AND sent_micros <= :sent_micros ORDER BY sent_micros DESC LIMIT 4

but this seems very wasteful and inefficient.

Is there a better way to handle this? Could window functions be useful? Would a different database engine help?


It seems likely that at some point I’ll add full-text search capability rather than simple LIKE-based matching. I imagine this will basically just give the message_id results for a query, so it feels like some kind of “return n rows around primary key values x, y, z” solution would be ideal. The LIKE-based query can be decomposed into that form, and would only be two queries rather than 2n + 1.

Advertisement

Answer

You can do it with ROW_NUMBER() window function:

WITH cte AS (
  SELECT *, 
         message_plain LIKE '%Waterdeep%' flag,
         ROW_NUMBER() OVER (ORDER BY conversation_id, sent_micros) rn
  FROM messages 
)
SELECT c1.message_id, c1.conversation_id, c1.sent_micros
FROM cte c1 INNER JOIN cte c2
ON c2.conversation_id = c1.conversation_id AND c1.rn BETWEEN c2.rn - 4 AND c2.rn + 4
WHERE c2.flag
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement