For a marketing related analysis I need to provide data on the first- and last-touchpoint and of the number of total interactions with our website.
A simplified version of our interaction
table looks like this:
create table interaction ( id varchar(36) primary key, session_id varchar(36) not null, timestamp timestamp(3) not null, utm_source varchar(255) null, utm_medium varchar(255) null )
Our current approach looks like this:
with interaction_ordered as ( select *, row_number() over (partition by session_id order by timestamp asc) as row_num_asc, row_number() over (partition by session_id order by timestamp desc) as row_num_desc from interaction ) select first_interaction.session_id as session_id, first_interaction.timestamp as session_start, timestampdiff(SECOND, first_interaction.timestamp, last_interaction.timestamp) as session_duration, count(*) as interaction_count, first_interaction.utm_source as first_touchpoint, last_interaction.utm_source as last_touchpoint, last_interaction.utm_medium as last_medium from interaction_ordered as interaction join interaction_ordered as first_interaction using (session_id) join interaction_ordered as last_interaction using (session_id) where first_interaction.row_num_asc = 1 and last_interaction.row_num_desc = 1 group by session_id having session_start between ? - interval 1 day and ? + interval 1 day
Currently, we observe that the runtime scales approximately linear with our data, which will become infeasible to compute soon.
An alternative idea is
select session_id, min(timestamp) as session_start, timestampdiff( SECOND, min(timestamp), max(timestamp) ) as session_duration, count(*) as interaction_count, first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint, first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint, first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium from interaction group by session_id having session_start between ? - interval 1 day and ? + interval 1 day
but in our experiments we never saw the second query complete. Hence, we are not a 100% sure that it yields the same results.
We tried indices on timestamp
and (session_id, timestamp)
, but according to EXPLAIN
this didn’t change the query plan.
Is there any fast way to retrieve individual properties from the first and last entry per session_id plus the count per session_id?
Note that in our real example there are more parameter like utm_source
and utm_medium
that we are interested in.
EDIT
Sample data:
insert into interaction values ('a', 'session_1', '2020-06-15T12:00:00.000', 'search.com', 'search'), ('b', 'session_1', '2020-06-15T12:01:00.000', null, null), ('c', 'session_1', '2020-06-15T12:01:30.000', 'social.com', 'social'), ('d', 'session_1', '2020-06-15T12:02:00.250', 'ads.com', 'ads'), ('e', 'session_2', '2020-06-15T14:00:00.000', null, null), ('f', 'session_2', '2020-06-15T14:12:00.000', null, null), ('g', 'session_2', '2020-06-15T14:25:00.000', 'social.com', 'social'), ('h', 'session_3', '2020-06-16T12:05:00.000', 'ads.com', 'ads'), ('i', 'session_3', '2020-06-16T12:05:01.000', null, null), ('j', 'session_4', '2020-06-15T12:00:00.000', null, null), ('k', 'session_5', '2020-06-15T12:00:00.000', 'search.com', 'search');
Expected result:
session_id, session_start, session_duration, interaction_count, first_touchpoint, last_touchpoint, last_medium session_1, 2020-06-15T12:00:00.000, 120, 4, search.com, ads.com, ads session_2, 2020-06-15T14:00:00.000, 1500, 3, null, social.com, social session_3, 2020-06-16T12:05:00.000, 1, 2, ads.com, null, null session_4, 2020-06-15T12:00:00.000, 0, 1, null, null, null session_5, 2020-06-15T12:00:00.000, 0, 1, search.com, search.com, search
I noticed that my second query doesn’t yield the expected result. The last_touchpoint
and last_medium
are filled with the first value instead.
I tried
first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
andlast_value(utm_source) over (partition by session_id order by timestamp range between unbounded preceding and unbounded following) as last_touchpoint,
Advertisement
Answer
The only way you are going to make the query scalable is by reducing the amount of data being processed using a where
clause. If I assume that sessions never last more than a day, then I can expand the timeframe for the calculation by a day and use window functions. That results in something like this:
select s.* from (select i.*, min(timestamp) over (partition by session_id) as session_start, count(*) over (partition by session_id) as interaction_count, first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint, first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint, first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium from interaction i where timestamp between ? - interval 2 day and ? + interval 2 day ) s where timestamp = session_start and session_start between ? - interval 1 day and ? + interval 1 day;
Your use of first_value()
should be returning an error — it violates the rules of “full group by” which MySQL 8+ has set by default. No surprise that syntactically incorrect code is not working.