this is my goal, split the dates into groups of 24 hours, but those depends of the data, not of a very specific date
Let’s say these are the calls from a call center, I want to know how many sessions I had, but those sessions are valid for 24 hours, those 24 hours are starting to count since the first date_sent, if the next call is later of those first 24 hours, a new session would be created
The expected results are like this: First 3 columns are those I already have in the table, fourth one is the required calculation
identifier customer_id date_sent StartOfSession sessionId 456456150 5366 2020-09-01T10:17:48.360000 2020-09-01T10:17:48.360000 1 456456150 5366 2020-09-01T18:24:45.552000 2020-09-01T10:17:48.360000 1 456456150 5366 2020-09-02T10:20:46.283000 2020-09-02T10:20:46.283000 2 456456150 5366 2020-09-02T18:25:01.911000 2020-09-02T10:20:46.283000 2 456456150 5366 2020-09-03T10:20:38.407000 2020-09-02T10:20:46.283000 2 456456150 5366 2020-09-03T18:23:35.915000 2020-09-03T18:23:35.915000 3 456456150 5366 2020-09-04T10:19:46.474000 2020-09-03T18:23:35.915000 3 456456150 5366 2020-09-04T14:22:17.236000 2020-09-03T18:23:35.915000 3 456456150 5366 2020-09-04T18:24:33.155000 2020-09-04T18:24:33.155000 4 456456150 5366 2020-09-05T10:19:48.871000 2020-09-04T18:24:33.155000 4 456456150 5366 2020-09-05T18:25:07.968000 2020-09-05T18:25:07.968000 5 456456150 5366 2020-09-06T10:19:34.808000 2020-09-05T18:25:07.968000 5 456456150 5366 2020-09-06T18:26:17.705000 2020-09-06T18:26:17.705000 6 456456150 5366 2020-09-07T10:21:28.585000 2020-09-06T18:26:17.705000 6 456456150 5366 2020-09-07T18:24:17.123000 2020-09-06T18:26:17.705000 6 456456150 5366 2020-09-08T10:20:09.850000 2020-09-08T10:20:09.850000 7 456456150 5366 2020-09-08T18:24:32.733000 2020-09-08T10:20:09.850000 7 456456150 5366 2020-09-09T10:20:05.336000 2020-09-08T10:20:09.850000 7 456456150 5366 2020-09-09T12:12:41.137000 2020-09-09T12:12:41.137000 8 456456150 5366 2020-09-09T18:24:25.783000 2020-09-09T12:12:41.137000 8
I’ve tried using window functions, but I cannot achieve the same expected results:
SELECT identifier, customer_id, date_sent, FIRST_VALUE(date_sent) OVER (PARTITION BY A.identifier, A.customer_id, CAST(A.date_sent AS DATE) ORDER BY UNIX_SECONDS(TIMESTAMP(date_sent)) RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) FirstV_date1 FROM `sandbox.testing` A WHERE identifier = '456456150' AND date_sent between '2020-09-01' AND '2020-09-10'
Those would be my actual results
identifier customer_id date FirstV_date1 456456150 5366 2020-09-01T10:17:48.360000 2020-09-01T10:17:48.360000 456456150 5366 2020-09-01T18:24:45.552000 2020-09-01T10:17:48.360000 456456150 5366 2020-09-02T10:20:46.283000 2020-09-02T10:20:46.283000 456456150 5366 2020-09-02T18:25:01.911000 2020-09-02T10:20:46.283000 456456150 5366 2020-09-03T10:20:38.407000 2020-09-03T10:20:38.407000 456456150 5366 2020-09-03T18:23:35.915000 2020-09-03T10:20:38.407000 456456150 5366 2020-09-04T10:19:46.474000 2020-09-04T10:19:46.474000 456456150 5366 2020-09-04T14:22:17.236000 2020-09-04T10:19:46.474000 456456150 5366 2020-09-04T18:24:33.155000 2020-09-04T10:19:46.474000 456456150 5366 2020-09-05T10:19:48.871000 2020-09-05T10:19:48.871000 456456150 5366 2020-09-05T18:25:07.968000 2020-09-05T10:19:48.871000 456456150 5366 2020-09-06T10:19:34.808000 2020-09-06T10:19:34.808000 456456150 5366 2020-09-06T18:26:17.705000 2020-09-06T10:19:34.808000 456456150 5366 2020-09-07T10:21:28.585000 2020-09-07T10:21:28.585000 456456150 5366 2020-09-07T18:24:17.123000 2020-09-07T10:21:28.585000 456456150 5366 2020-09-08T10:20:09.850000 2020-09-08T10:20:09.850000 456456150 5366 2020-09-08T18:24:32.733000 2020-09-08T10:20:09.850000 456456150 5366 2020-09-09T10:20:05.336000 2020-09-09T10:20:05.336000 456456150 5366 2020-09-09T12:12:41.137000 2020-09-09T10:20:05.336000 456456150 5366 2020-09-09T18:24:25.783000 2020-09-09T10:20:05.336000
I’ve also tried using self join, but I rather not because that’s very expensive, but, any ideas would be welcome.
Thanks in advance!
Advertisement
Answer
Below is for BigQuery Standard SQL (assumes that date_sent column is of timestamp data type – as it looks like in provided example)
#standardSQL create temp function get_sessions(x array<timestamp>) returns array<struct<date_sent timestamp, sessionStart timestamp, session string>> language js as """ output = []; session = 1; sessionStart = x[0]; total_dur = 0; a = {}; a.date_sent = x[0]; a.session = session; a.sessionStart = sessionStart; output.push(a); for(i = 1; i < x.length; i++){ a = {}; total_dur += x[i].getTime() - x[i-1].getTime(); if(total_dur>24*3600*1000){ total_dur = 0; session++; sessionStart = x[i]; }; a.date_sent = x[i-1]; a.sessionStart = sessionStart; a.session = session; output.push(a); } return output; """; select identifier, customer_id, date_sent, sessionStart, session from ( select identifier, customer_id, get_sessions(array_agg(date_sent order by date_sent)) sessions from `project.dataset.table` group by identifier, customer_id ), unnest(sessions)
if to apply to sample data in the question – output is
Important: assumption here – based on your comments – volume of rows per partition (identifier, customer_id) is relatively small (~2K) so js udf memory limit is not a problem here)