I have a user’s table that has info about user_id, places user purchased tickets, and the time user had purchased the ticket.
Users:
|------------|-------------|----------------------| | user_id | place | purchase_time | |------------|-------------|----------------------| | 1 | New York | 2021-11-27:17:00:21 | | 1 | Chicago | 2021-11-25:19:00:21 | | 1 | Chicago | 2021-11-23:03:00:21 | | 1 | Washington | 2021-11-21:07:00:21 | | 1 | Washington | 2021-11-19:12:00:21 | | 1 | Washington | 2021-11-17:00:00:21 | | 1 | Washington | 2021-11-15:23:00:21 | | 1 | Washington | 2021-11-12:21:00:21 | | 2 | Chicago | 2021-09-25:01:00:21 | | 2 | Milwaukee | 2021-09-24:02:00:21 | | 2 | Milwaukee | 2021-09-23:03:00:21 | | 2 | New York | 2021-09-22:19:00:21 | | 2 | Chicago | 2021-09-21:01:00:21 | | 3 | Milwaukee | 2021-10-27:12:31:21 | | 3 | Washington | 2021-10-24:07:01:23 | | 3 | Chicago | 2021-10-21:01:78:89 | |------------|-------------|----------------------|
I want to add a new column that shows the user’s most visited place at the time of purchasing ticket. The table would like (Snowflake):
|------------|-------------|----------------------|---------------------| | user_id | place | purchase_time | most_visited_place | |------------|-------------|----------------------|---------------------| | 1 | New York | 2021-11-27:17:00:21 | Washington | <--- Washington, because at purchase_time This place was most visited by the user | 1 | Chicago | 2021-11-25:19:00:21 | Washington | <--- Washington, because at purchase_time This place was most visited by the user | 1 | Chicago | 2021-11-23:03:00:21 | Washington | <--- Washington, because at purchase_time This place was most visited by the user | 1 | Washington | 2021-11-21:07:00:21 | Washington | <--- Washington, because at purchase_time This place was most visited by the user | 1 | Washington | 2021-11-19:12:00:21 | Washington | <--- Washington, because at purchase_time This place was most visited by the user | 1 | Washington | 2021-11-17:00:00:21 | Washington | <--- Washington, because at purchase_time This place was most visited by the user | 1 | Washington | 2021-11-15:23:00:21 | Washington | <--- Washington, because at purchase_time This place was most visited by the user | 1 | Washington | 2021-11-12:21:00:21 | Washington | <--- Washington, because at purchase_time This place was most visited by the user | 2 | Chicago | 2021-09-21:01:00:25 | Chicago | <-- tie, break. Both Chicago and Milwaukee were most visited then take the recent most visited | 2 | Milwaukee | 2021-09-21:02:00:24 | Milwaukee | <--- Milwaukee, because at purchase_time This place was most visited by the user | 2 | Milwaukee | 2021-09-21:03:00:23 | Milwaukee | <--- Milwaukee, because at purchase_time This place was most visited by the user | 2 | New York | 2021-09-21:19:00:22 | New York | <-- tie, break. Both Chicago and New York were most visited then take the recent most visited | 2 | Chicago | 2021-09-21:01:00:21 | Chicago | <--- Chicago, because at purchase_time This place was most visited by the user | 3 | Milwaukee | 2021-10-27:12:31:21 | Milwaukee | | 3 | Washington | 2021-10-24:07:01:23 | Washington | | 3 | Chicago | 2021-10-21:01:78:89 | Chicago | |------------|-------------|----------------------|---------------------|
Advertisement
Answer
You want to use a WINDOW version of COUNT to get the “prior rows count” and then join to all the prior counted rows, and filter out the “best” via a QUALIFY
WITH prior_user AS (
    SELECT 
        user_id,
        place,
        purchase_time,
        COUNT(place) OVER (PARTITION BY user_id, place ORDER BY purchase_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS place_count
    FROM users
)
SELECT 
    u.user_id,
    u.place,
    u.purchase_time,
    p.place AS most_visited_place
FROM users u
JOIN prior_user p
    ON u.user_id = p.user_id AND u.purchase_time >= p.purchase_time
QUALIFY row_number() OVER (partition by u.user_id, u.purchase_time ORDER BY place_count DESC, p.purchase_time DESC) = 1
*this sql has not been run.