I got a search results log table with search terms in one of the columns. Since the search results are produced as the user types in the search terms, there are multiple rows for each search term with the partial string. For example, as the user types world
the resulting rows in the table will be:
ts | userid | string_term | number of results 1000 | u22 | wo | 35 1002 | u22 | wor | 14 1003 | u22 | worl | 18 1005 | u22 | world | 8 1100 | u22 | he | 21 1102 | u22 | hell | 11 1103 | u22 | hello | 6
Expected results:
ts | userid | string_term | number of results 1005 | u22 | world | 8 1103 | u22 | hello | 6
I’m using Hive and wanted to know if there is a way to compare the substrings across multiple rows by same user within a time limit?
Thanks!
Advertisement
Answer
You can treat this as a gaps-and-islands problem. Look back and see if the word does not match and then accumulate and aggregate:
with words as ( select t.*, sum(case when prev_word is null or word like prev_word || '%' or prev_ts < ts - 60 -- say within a minute then 1 else 0 end) as grp from (select t.*, lag(word) over (partition by userid order by ts) as prev_word, lag(ts) over (partition by userid order by ts) as prev_ts from t ) t ) select w.* from (select w.*, row_number() over (partition by userid, grp order by ts desc) as seqnum from words w ) w where seqnum = 1;
Actually, there is an easier way. Just look forward:
select w.* from (select t.*, lag(word) over (partition by userid order by ts) as next_word, lead(ts) over (partition by userid order by ts) as next_ts from t ) t where next_ts is null or next_ts > ts + 60 or next_word not like word || '%';