Skip to content
Advertisement

Hive – Merge rows with search term substrings

I got a search results log table with search terms in one of the columns. Since the search results are produced as the user types in the search terms, there are multiple rows for each search term with the partial string. For example, as the user types world the resulting rows in the table will be:

ts    | userid      | string_term | number of results
1000  | u22         |    wo       | 35
1002  | u22         |    wor      | 14
1003  | u22         |    worl     | 18
1005  | u22         |    world    | 8
1100  | u22         |    he       | 21
1102  | u22         |    hell     | 11
1103  | u22         |    hello    | 6

Expected results:

ts    | userid      | string_term | number of results
1005  | u22         |    world    | 8
1103  | u22         |    hello    | 6

I’m using Hive and wanted to know if there is a way to compare the substrings across multiple rows by same user within a time limit?

Thanks!

Advertisement

Answer

You can treat this as a gaps-and-islands problem. Look back and see if the word does not match and then accumulate and aggregate:

with words as (
      select t.*,
             sum(case when prev_word is null or
                           word like prev_word || '%' or
                           prev_ts < ts - 60   -- say within a minute
                      then 1 else 0
                 end) as grp
      from (select t.*, 
                   lag(word) over (partition by userid order by ts) as prev_word,
                   lag(ts) over (partition by userid order by ts) as prev_ts
            from t
           ) t
     )
select w.*
from (select w.*,
             row_number() over (partition by userid, grp order by ts desc) as seqnum
      from words w
     ) w
where seqnum = 1;

Actually, there is an easier way. Just look forward:

select w.*
from (select t.*, 
             lag(word) over (partition by userid order by ts) as next_word,
              lead(ts) over (partition by userid order by ts) as next_ts
       from t
      ) t
where next_ts is null or
      next_ts > ts + 60 or
      next_word not like word || '%';
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement