I have a mobile app browsing history dataset as shown below.
- DeviceDateTime: Date and Time the User views the page in the mobile app.
- UserID: each UserID represents a visitor who login the mobile app.
- PageName: There are different pages in the Mobile App. All visitors would first land on the Home page, and then navigate to different pages.
- PageSequence: the order of the pages visited. For example, Seq_1 Home > Seq_2 My Account = first landing on Home page, then navigate to “My Account” page.
DeviceDateTime | UserID | PageName | PageSequence |
---|---|---|---|
2021-01-19 16:40:00.000 | UserA | Home | Seq_1 |
2021-01-19 16:40:00.000 | UserA | My Account | Seq_2 |
2021-01-19 16:40:07.000 | UserA | My Activity | Seq_3 |
2021-01-19 16:40:07.000 | UserA | Restaurant Listing | Seq_4 |
2021-01-19 16:40:18.000 | UserA | Restaurant Details Page | Seq_5 |
2021-01-19 16:40:31.000 | UserA | Restaurant Details Page | Seq_6 |
2021-01-19 16:40:31.000 | UserA | Restaurant Booking Confirmation | Seq_7 |
2021-01-19 16:40:40.000 | UserA | Home | Seq_8 |
2021-01-19 16:40:45.000 | UserA | Write To Us | Seq_9 |
2021-01-19 16:40:46.000 | UserA | Home | Seq_10 |
2021-01-28 21:11:53.000 | UserB | Home | Seq_1 |
2021-01-28 21:12:01.000 | UserB | Restaurant Listing | Seq_2 |
2021-01-28 21:13:37.000 | UserB | Restaurant Listing | Seq_3 |
2021-02-16 09:43:27.000 | UserA | Home | Seq_1 |
2021-02-16 09:43:43.000 | UserA | Write To Us | Seq_2 |
2021-02-16 09:44:50.000 | UserA | My Account | Seq_3 |
2021-02-16 09:45:03.000 | UserA | My Activity | Seq_4 |
I need to do the following pre-processing in SQL:
- Summarize the dataset into a table as shown below. I would like to show in the table the “FROM” (source page) and “TO” (Target page). For example, 1 visitor has been through the journey FROM “Home” page to “My Account” page; 2 visitors have been through the journey FROM “Home” page to “Write To Us”.
- When the journey is FROM and TO the same page, do not count the journey. For example, UserA Seq_5 and Seq_6, “FROM Restaurant Details Page” TO “Restaurant Details Page” should not be included.
- Journey across two different users should not be included. For example, FROM “Restaurant Listing” (UserB Seq_3) TO “Home” (UserA Seq_1) should not be included.
Result Table:
FROM | TO | No_of_Users |
---|---|---|
Home | My Account | 1 |
My Account | My Activity | 2 |
My Activity | Restaurant Listing | 1 |
Restaurant Listing | Restaurant Details Page | 1 |
Restaurant Details Page | Restaurant Booking Confirmation | 1 |
Restaurant Booking Confirmation | Home | 1 |
Home | Write To Us | 2 |
Write To Us | Home | 1 |
Home | Restaurant Listing | 1 |
Home | Restaurant Listing | 1 |
Write To Us | My Account | 1 |
I have around 600,000 users and a total of 21 unique PageName in the dataset.
I have tried the following script but it didn’t work. I failed to capture all the possible journeys in the summary table. For example, FROM “Write To Us” TO “My Account”, FROM “Restaurant Booking Confirmation” TO “Home” are all missing in the result.
DROP TABLE IF EXISTS #App CREATE TABLE #App ( DeviceDateTime SMALLDATETIME, UserID VARCHAR(100), PageName VARCHAR(100), PageSequence VARCHAR(100)) INSERT INTO #App VALUES ('2021-01-19 16:40:00.000','UserA', 'Home', 'Seq_1'), ('2021-01-19 16:40:00.000','UserA', 'My Account', 'Seq_2'), ('2021-01-19 16:40:07.000','UserA', 'My Activity', 'Seq_3'), ('2021-01-19 16:40:07.000','UserA', 'Restaurant Listing', 'Seq_4'), ('2021-01-19 16:40:18.000','UserA', 'Restaurant Details Page', 'Seq_5'), ('2021-01-19 16:40:31.000','UserA', 'Restaurant Details Page', 'Seq_6'), ('2021-01-19 16:40:31.000','UserA', 'Restaurant Booking Confirmation', 'Seq_7'), ('2021-01-19 16:40:40.000','UserA', 'Home', 'Seq_8'), ('2021-01-19 16:40:45.000','UserA', 'Write To Use', 'Seq_9'), ('2021-01-19 16:40:46.000','UserA', 'Home', 'Seq_10'), ('2021-01-28 21:11:53.000','UserB', 'Home', 'Seq_1'), ('2021-01-28 21:12:01.000','UserB', 'Restaurant Listing', 'Seq_2'), ('2021-01-28 21:13:37.000','UserB', 'Restaurant Listing', 'Seq_3'), ('2021-02-16 09:43:27.000','UserA', 'Home', 'Seq_1'), ('2021-02-16 09:43:43.000','UserA', 'Write To Us', 'Seq_2'), ('2021-02-16 09:44:50.000','UserA', 'My Account', 'Seq_3'), ('2021-02-16 09:45:03.000','UserA', 'My Activity', 'Seq_4'); DROP TABLE IF EXISTS #SD with seq_fixed as ( select UserID, DeviceDateTime, PageName, cast(right(PageSequence, charindex('_', reverse(PageSequence)) - 1) as int) as pagesequencefinal from #App ) , with_next as ( select UserID, DeviceDateTime, PageName, lead(PageName) over (partition by UserID, DeviceDateTime order by UserID, DeviceDateTime ASC) as next_pagename from seq_fixed group by UserID, DeviceDateTime, PageName ) select PageName, next_pagename, count(*) AS No_of_User into #SD from with_next where next_pagename is not null group by PageName, next_pagename order by PageName, next_pagename; SELECT * FROM #SD
Advertisement
Answer
In your previous request you wanted to only acknowledge a page’s first ocurrence per user, so Home->PageX->Home->PageY would be interpreted as Home->PageX->PageY. For this to happen you had to group by user and page to find this first occurrence.
This is not the case in this new request, so don’t aggregate:
with seq_fixed as ( select userid, pagename, cast(right(pagesequence, charindex('_', reverse(pagesequence)) - 1) as int) as pagesequencefinal from app ) , with_next as ( select userid, pagename, lead(pagename) over (partition by userid order by pagesequencefinal) as next_pagename from seq_fixed ) select pagename, next_pagename, count(*) from with_next where next_pagename is not null group by pagename, next_pagename order by pagename, next_pagename;
The only problem with this: When a user ended on PageX yesterday and starts with Home today, this will count as PageX->Home. If you want to prevent this from happening you need some detection for this situation, e.g. don’t consider an entry a page change when it’s pedecessor is at least 1 hour old or the like. For this you can use your timestamp column and LAG
.