Skip to content
Advertisement

BigQuery: need a clever solution for a difficult query

The context of this problem is a Live Chat service. Each conversation consists of multiple messages of which the timestamp is registred. Conversations are held in one or more channels, depending on the difficulty of the inquiry.

The picture below shows an example of the data. For testing purposes the sheet can be found here.

enter image description here

In the picture you see the initial difficulty is determined at row 5 (based on the content of the messages). Rows 2 till 4 should have the same difficulty, as these messages form a group. After that it gets easier, as all messages between two difficulties belong to the one earlier in time (basically messages are of difficulty X until difficulty Y is introduced). The colors in the picture show which difficulty should belong to which messages.

So far I have succeeded in writing a query which can assign the correct difficulty to rows 5 till 29. Using the same logic and a subquery, I could probably also find a way to assign the correct difficulty to messages before the first channel ID, but I was wondering of any of you see a cleaner, more intuitive way to solve this.

WITH channelMessage AS (
SELECT conversation, channel, timestamp
FROM MessagesTable
WHERE channel IS NOT NULL
AND channel != '' 
),

messages AS (
SELECT a.message, b.channel, a.conversation, a.timestamp,
CASE WHEN b.timestamp IS NULL THEN 0 ELSE TIMESTAMP_DIFF(b.timestamp, a.timestamp, MILLISECOND) END AS diff,
CASE WHEN b.timestamp IS NULL THEN 0 ELSE MAX(TIMESTAMP_DIFF(b.timestamp, a.timestamp, MILLISECOND)) OVER (PARTITION BY a.message ORDER BY a.timestamp) END AS diff2
FROM MessagesTable AS a
LEFT JOIN (SELECT * FROM channelMessage) AS b
ON a.conversation = b.conversation AND a.timestamp >= b.timestamp
)

SELECT timestamp, conversation, message, channel, diff, diff2,
FROM messages 
WHERE diff = diff2
ORDER BY timestamp

Advertisement

Answer

The channel for each message that doesn’t already have one is the channel of the most recent earlier message in the same conversation that does have a channel. To handle rows 2-4 we can say that if there is no earlier message with a channel, we’ll take the earliest message in the conversation that does have a channel.

I’ve renamed your table to “Messages” for clarity.

SELECT a.timestamp, a.conversation, 
  COALESCE(
    -- Message already has a channel
    a.channel,
    -- Channel from most recent earlier message
    (SELECT MAX(c.channel) FROM Messages c
         WHERE c.conversation = a.conversation
         AND c.timestamp =
           (SELECT MAX(c2.timestamp) FROM Messages c2
                WHERE c2.conversation = a.conversation
                AND c2.channel IS NOT NULL
                AND c2.timestamp < a.timestamp)),
    -- Channel of earliest message
    (SELECT MAX(c.channel) FROM Messages c
         WHERE c.conversation = a.conversation
         AND c.timestamp =
           (SELECT MIN(c2.timestamp) FROM Messages c2
                WHERE c2.conversation = a.conversation
                AND c2.channel IS NOT NULL))) AS channel
FROM Messages a;

There’s another solution that involves assigning a number to each row of the original table, then using a recursive CTE to find the previous non-NULL value. You’d need to figure out how to handle rows 2-4 in this case. I think my solution is a more straightforward implementation of the channel selection logic you described.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement