Skip to content
Advertisement

SQL: Use LEAD() and PARTITION BY to access to the next row following the current row

I have a mobile app browsing history dataset as shown below.

  • DeviceDateTime: Date and Time the User views the page in the mobile app.
  • UserID: each UserID represents a visitor who login the mobile app.
  • PageName: There are different pages in the Mobile App. All visitors would first land on the Home page, and then navigate to different pages.
  • PageSequence: the order of the pages visited. For example, Seq_1 Home > Seq_2 My Account = first landing on Home page, then navigate to “My Account” page.
DeviceDateTime UserID PageName PageSequence
2021-01-19 16:40:00.000 UserA Home Seq_1
2021-01-19 16:40:00.000 UserA My Account Seq_2
2021-01-19 16:40:07.000 UserA My Activity Seq_3
2021-01-19 16:40:07.000 UserA Restaurant Listing Seq_4
2021-01-19 16:40:18.000 UserA Restaurant Details Page Seq_5
2021-01-19 16:40:31.000 UserA Restaurant Details Page Seq_6
2021-01-19 16:40:31.000 UserA Restaurant Booking Confirmation Seq_7
2021-01-19 16:40:40.000 UserA Home Seq_8
2021-01-19 16:40:45.000 UserA Write To Us Seq_9
2021-01-19 16:40:46.000 UserA Home Seq_10
2021-01-28 21:11:53.000 UserB Home Seq_1
2021-01-28 21:12:01.000 UserB Restaurant Listing Seq_2
2021-01-28 21:13:37.000 UserB Restaurant Listing Seq_3
2021-02-16 09:43:27.000 UserA Home Seq_1
2021-02-16 09:43:43.000 UserA Write To Us Seq_2
2021-02-16 09:44:50.000 UserA My Account Seq_3
2021-02-16 09:45:03.000 UserA My Activity Seq_4

I need to do the following pre-processing in SQL:

  1. Summarize the dataset into a table as shown below. I would like to show in the table the “FROM” (source page) and “TO” (Target page). For example, 1 visitor has been through the journey FROM “Home” page to “My Account” page; 2 visitors have been through the journey FROM “Home” page to “Write To Us”.
  2. When the journey is FROM and TO the same page, do not count the journey. For example, UserA Seq_5 and Seq_6, “FROM Restaurant Details Page” TO “Restaurant Details Page” should not be included.
  3. Journey across two different users should not be included. For example, FROM “Restaurant Listing” (UserB Seq_3) TO “Home” (UserA Seq_1) should not be included.

Result Table:

FROM TO No_of_Users
Home My Account 1
My Account My Activity 2
My Activity Restaurant Listing 1
Restaurant Listing Restaurant Details Page 1
Restaurant Details Page Restaurant Booking Confirmation 1
Restaurant Booking Confirmation Home 1
Home Write To Us 2
Write To Us Home 1
Home Restaurant Listing 1
Home Restaurant Listing 1
Write To Us My Account 1

I have around 600,000 users and a total of 21 unique PageName in the dataset.

I have tried the following script but it didn’t work. I failed to capture all the possible journeys in the summary table. For example, FROM “Write To Us” TO “My Account”, FROM “Restaurant Booking Confirmation” TO “Home” are all missing in the result.

DROP TABLE IF EXISTS #App
CREATE TABLE #App (
    DeviceDateTime SMALLDATETIME,
    UserID VARCHAR(100),
    PageName VARCHAR(100),
    PageSequence VARCHAR(100))
INSERT INTO #App VALUES
    ('2021-01-19 16:40:00.000','UserA', 'Home', 'Seq_1'),
    ('2021-01-19 16:40:00.000','UserA', 'My Account', 'Seq_2'),
    ('2021-01-19 16:40:07.000','UserA', 'My Activity', 'Seq_3'),
    ('2021-01-19 16:40:07.000','UserA', 'Restaurant Listing', 'Seq_4'),
    ('2021-01-19 16:40:18.000','UserA', 'Restaurant Details Page', 'Seq_5'),
    ('2021-01-19 16:40:31.000','UserA', 'Restaurant Details Page', 'Seq_6'),
    ('2021-01-19 16:40:31.000','UserA', 'Restaurant Booking Confirmation', 'Seq_7'),
    ('2021-01-19 16:40:40.000','UserA', 'Home', 'Seq_8'),
    ('2021-01-19 16:40:45.000','UserA', 'Write To Use', 'Seq_9'),
    ('2021-01-19 16:40:46.000','UserA', 'Home', 'Seq_10'),
    ('2021-01-28 21:11:53.000','UserB', 'Home', 'Seq_1'),
    ('2021-01-28 21:12:01.000','UserB', 'Restaurant Listing', 'Seq_2'),
    ('2021-01-28 21:13:37.000','UserB', 'Restaurant Listing', 'Seq_3'),
    ('2021-02-16 09:43:27.000','UserA', 'Home', 'Seq_1'),
    ('2021-02-16 09:43:43.000','UserA', 'Write To Us', 'Seq_2'),
    ('2021-02-16 09:44:50.000','UserA', 'My Account', 'Seq_3'),
    ('2021-02-16 09:45:03.000','UserA', 'My Activity', 'Seq_4');

DROP TABLE IF EXISTS #SD
with seq_fixed as
(
  select
    UserID,
    DeviceDateTime,
    PageName,
    cast(right(PageSequence, charindex('_', reverse(PageSequence)) - 1) as int) as pagesequencefinal
  from #App
)
, with_next as
(
  select
    UserID,
    DeviceDateTime,
    PageName,
    lead(PageName) over (partition by UserID, DeviceDateTime order by UserID, DeviceDateTime ASC) as next_pagename
  from seq_fixed
  group by UserID, DeviceDateTime, PageName
)
select PageName, next_pagename, count(*) AS No_of_User
into #SD
from with_next
where next_pagename is not null
group by PageName, next_pagename
order by PageName, next_pagename;

SELECT * FROM #SD

Advertisement

Answer

In your previous request you wanted to only acknowledge a page’s first ocurrence per user, so Home->PageX->Home->PageY would be interpreted as Home->PageX->PageY. For this to happen you had to group by user and page to find this first occurrence.

This is not the case in this new request, so don’t aggregate:

with seq_fixed as
(
  select
    userid,
    pagename,
    cast(right(pagesequence, charindex('_', reverse(pagesequence)) - 1) as int) as pagesequencefinal
  from app
)
, with_next as
(
  select
    userid,
    pagename,
    lead(pagename) over (partition by userid order by pagesequencefinal) as next_pagename
  from seq_fixed
)
select pagename, next_pagename, count(*)
from with_next
where next_pagename is not null
group by pagename, next_pagename
order by pagename, next_pagename;

The only problem with this: When a user ended on PageX yesterday and starts with Home today, this will count as PageX->Home. If you want to prevent this from happening you need some detection for this situation, e.g. don’t consider an entry a page change when it’s pedecessor is at least 1 hour old or the like. For this you can use your timestamp column and LAG.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement