Skip to content
Advertisement

Getting values from the first row, last row, and an aggregation in MySQL Window function

For a marketing related analysis I need to provide data on the first- and last-touchpoint and of the number of total interactions with our website.

A simplified version of our interaction table looks like this:

create table interaction (
    id varchar(36) primary key,
    session_id varchar(36) not null,
    timestamp timestamp(3) not null,
    utm_source varchar(255) null,
    utm_medium varchar(255) null
)

Our current approach looks like this:

with interaction_ordered as (
    select *, 
           row_number() over (partition by session_id order by timestamp asc) as row_num_asc,
           row_number() over (partition by session_id order by timestamp desc) as row_num_desc
    from interaction
)

select first_interaction.session_id as session_id,
       first_interaction.timestamp as session_start,
       timestampdiff(SECOND, first_interaction.timestamp, last_interaction.timestamp) as session_duration,
       count(*) as interaction_count,
       first_interaction.utm_source as first_touchpoint,
       last_interaction.utm_source as last_touchpoint,
       last_interaction.utm_medium as last_medium
from interaction_ordered as interaction
join interaction_ordered as first_interaction using (session_id)
join interaction_ordered as last_interaction using (session_id)
where first_interaction.row_num_asc = 1 and last_interaction.row_num_desc = 1
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day

Currently, we observe that the runtime scales approximately linear with our data, which will become infeasible to compute soon.

An alternative idea is

select session_id,
       min(timestamp) as session_start,
       timestampdiff(
           SECOND,
           min(timestamp),
           max(timestamp)
       ) as session_duration,
       count(*) as interaction_count,
       first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint,
       first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
       first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium
from interaction
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day

but in our experiments we never saw the second query complete. Hence, we are not a 100% sure that it yields the same results.

We tried indices on timestamp and (session_id, timestamp), but according to EXPLAIN this didn’t change the query plan.

Is there any fast way to retrieve individual properties from the first and last entry per session_id plus the count per session_id? Note that in our real example there are more parameter like utm_source and utm_medium that we are interested in.

EDIT

Sample data:

insert into interaction values
('a', 'session_1', '2020-06-15T12:00:00.000', 'search.com', 'search'),
('b', 'session_1', '2020-06-15T12:01:00.000', null, null),
('c', 'session_1', '2020-06-15T12:01:30.000', 'social.com', 'social'),
('d', 'session_1', '2020-06-15T12:02:00.250', 'ads.com', 'ads'),

('e', 'session_2', '2020-06-15T14:00:00.000', null, null),
('f', 'session_2', '2020-06-15T14:12:00.000', null, null),
('g', 'session_2', '2020-06-15T14:25:00.000', 'social.com', 'social'),

('h', 'session_3', '2020-06-16T12:05:00.000', 'ads.com', 'ads'),
('i', 'session_3', '2020-06-16T12:05:01.000', null, null),

('j', 'session_4', '2020-06-15T12:00:00.000', null, null),
('k', 'session_5', '2020-06-15T12:00:00.000', 'search.com', 'search');

Expected result:

session_id, session_start, session_duration, interaction_count, first_touchpoint, last_touchpoint, last_medium
session_1, 2020-06-15T12:00:00.000, 120, 4, search.com, ads.com, ads
session_2, 2020-06-15T14:00:00.000, 1500, 3, null, social.com, social
session_3, 2020-06-16T12:05:00.000, 1, 2, ads.com, null, null
session_4, 2020-06-15T12:00:00.000, 0, 1, null, null, null
session_5, 2020-06-15T12:00:00.000, 0, 1, search.com, search.com, search

I noticed that my second query doesn’t yield the expected result. The last_touchpoint and last_medium are filled with the first value instead. I tried

  • first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint, and
  • last_value(utm_source) over (partition by session_id order by timestamp range between unbounded preceding and unbounded following) as last_touchpoint,

Advertisement

Answer

The only way you are going to make the query scalable is by reducing the amount of data being processed using a where clause. If I assume that sessions never last more than a day, then I can expand the timeframe for the calculation by a day and use window functions. That results in something like this:

select s.*
from (select i.*,
             min(timestamp) over (partition by session_id) as session_start,
             count(*) over (partition by session_id) as interaction_count,
             first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint,
             first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
             first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium
      from interaction i
      where timestamp between ? - interval 2 day and ? + interval 2 day
     ) s
where timestamp = session_start and
      session_start between ? - interval 1 day and ? + interval 1 day;

Your use of first_value() should be returning an error — it violates the rules of “full group by” which MySQL 8+ has set by default. No surprise that syntactically incorrect code is not working.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement