I have huge pageview level table, where each session has multiple pageviews.
Session -> Pageviews – 1 : M
From the pageviews table, I need to consolidate the data into one row per session with meta data like FirstPageview, LastPageview, #ofvisitstoLogin
Firstpageview is defined as the very first pageview for a session (this can be login)
Lastpageview is defined as a pageview that is not /login (this cannot be login, if it is pick the pageview before this)
#ofvisitstologin is defined as the #oftimesthesession landed on login
visitorid | sessionid | timestamp | webpagepath |
---|---|---|---|
abc | 123 | 2021-01-01 10:30:00 | homepage |
abc | 123 | 2021-01-01 10:30:50 | login |
abc | 123 | 2021-01-01 10:31:00 | page2 |
abc | 123 | 2021-01-01 10:32:50 | page3 |
abc | 123 | 2021-01-01 10:33:00 | page2 |
abc | 123 | 2021-01-01 10:35:50 | login |
The output that I want,
visitorid | sessionid | FristPageview | LastPageview | #ofvisitstoLogin |
---|---|---|---|---|
abc | 123 | homepage | page2 | 2 |
Is it possible to do this with window functions? Without using subqueries / joins because this is a huge table and it affects performance.
Sample query that I have,
select visitor_id, session_id, first_value(webpage_path) over (partition by visitor_id, session_id order by timestamp asc) as first_pageview, last_value(webpage_path) over (partition by visitor_id, session_id order by timestamp asc) as last_page_view, sum(iff(webpage_path = 'login', 1, 0)) over (partition by visitor_id, session_id timestamp asc) as noofvisitstologin from pageviews;
But this code doesn’t work because for the last_page_view column will be “login” if i run this code. But i need it to be “page2” instead.
Is this achievable without using subqueries / self join?
Advertisement
Answer
LAST_VALUE supports IGNORE NULLS
clause and it could be combined with NULLIF(webpage_path, 'login')
:
select visitor_id, session_id, first_value(webpage_path) over (partition by visitor_id, session_id order by timestamp asc) as first_pageview, last_value(NULLIF(webpage_path,'login')) ignore nulls over (partition by visitor_id, session_id order by timestamp asc) as last_page_view, count_if(webpage_path = 'login') over (partition by visitor_id, session_id timestamp asc) as noofvisitstologin from pageviews;
Second: SUM(IIF(...))
could be replaced with COUNT_IF().