How to split dates into groups of 24 hours -BigQuery

this is my goal, split the dates into groups of 24 hours, but those depends of the data, not of a very specific date

Let’s say these are the calls from a call center, I want to know how many sessions I had, but those sessions are valid for 24 hours, those 24 hours are starting to count since the first date_sent, if the next call is later of those first 24 hours, a new session would be created

The expected results are like this: First 3 columns are those I already have in the table, fourth one is the required calculation

identifier  customer_id date_sent               StartOfSession          sessionId
456456150   5366    2020-09-01T10:17:48.360000  2020-09-01T10:17:48.360000  1
456456150   5366    2020-09-01T18:24:45.552000  2020-09-01T10:17:48.360000  1
456456150   5366    2020-09-02T10:20:46.283000  2020-09-02T10:20:46.283000  2
456456150   5366    2020-09-02T18:25:01.911000  2020-09-02T10:20:46.283000  2
456456150   5366    2020-09-03T10:20:38.407000  2020-09-02T10:20:46.283000  2
456456150   5366    2020-09-03T18:23:35.915000  2020-09-03T18:23:35.915000  3
456456150   5366    2020-09-04T10:19:46.474000  2020-09-03T18:23:35.915000  3
456456150   5366    2020-09-04T14:22:17.236000  2020-09-03T18:23:35.915000  3
456456150   5366    2020-09-04T18:24:33.155000  2020-09-04T18:24:33.155000  4
456456150   5366    2020-09-05T10:19:48.871000  2020-09-04T18:24:33.155000  4
456456150   5366    2020-09-05T18:25:07.968000  2020-09-05T18:25:07.968000  5
456456150   5366    2020-09-06T10:19:34.808000  2020-09-05T18:25:07.968000  5
456456150   5366    2020-09-06T18:26:17.705000  2020-09-06T18:26:17.705000  6
456456150   5366    2020-09-07T10:21:28.585000  2020-09-06T18:26:17.705000  6
456456150   5366    2020-09-07T18:24:17.123000  2020-09-06T18:26:17.705000  6
456456150   5366    2020-09-08T10:20:09.850000  2020-09-08T10:20:09.850000  7
456456150   5366    2020-09-08T18:24:32.733000  2020-09-08T10:20:09.850000  7
456456150   5366    2020-09-09T10:20:05.336000  2020-09-08T10:20:09.850000  7
456456150   5366    2020-09-09T12:12:41.137000  2020-09-09T12:12:41.137000  8
456456150   5366    2020-09-09T18:24:25.783000  2020-09-09T12:12:41.137000  8

​x
 
identifier  customer_id date_sent               StartOfSession          sessionId456456150   5366    2020-09-01T10:17:48.360000  2020-09-01T10:17:48.360000  1456456150   5366    2020-09-01T18:24:45.552000  2020-09-01T10:17:48.360000  1456456150   5366    2020-09-02T10:20:46.283000  2020-09-02T10:20:46.283000  2456456150   5366    2020-09-02T18:25:01.911000  2020-09-02T10:20:46.283000  2456456150   5366    2020-09-03T10:20:38.407000  2020-09-02T10:20:46.283000  2456456150   5366    2020-09-03T18:23:35.915000  2020-09-03T18:23:35.915000  3456456150   5366    2020-09-04T10:19:46.474000  2020-09-03T18:23:35.915000  3456456150   5366    2020-09-04T14:22:17.236000  2020-09-03T18:23:35.915000  3456456150   5366    2020-09-04T18:24:33.155000  2020-09-04T18:24:33.155000  4456456150   5366    2020-09-05T10:19:48.871000  2020-09-04T18:24:33.155000  4456456150   5366    2020-09-05T18:25:07.968000  2020-09-05T18:25:07.968000  5456456150   5366    2020-09-06T10:19:34.808000  2020-09-05T18:25:07.968000  5456456150   5366    2020-09-06T18:26:17.705000  2020-09-06T18:26:17.705000  6456456150   5366    2020-09-07T10:21:28.585000  2020-09-06T18:26:17.705000  6456456150   5366    2020-09-07T18:24:17.123000  2020-09-06T18:26:17.705000  6456456150   5366    2020-09-08T10:20:09.850000  2020-09-08T10:20:09.850000  7456456150   5366    2020-09-08T18:24:32.733000  2020-09-08T10:20:09.850000  7456456150   5366    2020-09-09T10:20:05.336000  2020-09-08T10:20:09.850000  7456456150   5366    2020-09-09T12:12:41.137000  2020-09-09T12:12:41.137000  8456456150   5366    2020-09-09T18:24:25.783000  2020-09-09T12:12:41.137000  8​

I’ve tried using window functions, but I cannot achieve the same expected results:

  SELECT identifier, customer_id, date_sent,
    FIRST_VALUE(date_sent) OVER (PARTITION BY A.identifier, A.customer_id, CAST(A.date_sent AS DATE) ORDER BY UNIX_SECONDS(TIMESTAMP(date_sent)) RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) FirstV_date1
  FROM `sandbox.testing` A
  WHERE identifier = '456456150'
AND date_sent between '2020-09-01' AND '2020-09-10'

 
  SELECT identifier, customer_id, date_sent,    FIRST_VALUE(date_sent) OVER (PARTITION BY A.identifier, A.customer_id, CAST(A.date_sent AS DATE) ORDER BY UNIX_SECONDS(TIMESTAMP(date_sent)) RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) FirstV_date1  FROM `sandbox.testing` A  WHERE identifier = '456456150'AND date_sent between '2020-09-01' AND '2020-09-10'​

Those would be my actual results

identifier  customer_id date                    FirstV_date1
456456150   5366    2020-09-01T10:17:48.360000  2020-09-01T10:17:48.360000
456456150   5366    2020-09-01T18:24:45.552000  2020-09-01T10:17:48.360000
456456150   5366    2020-09-02T10:20:46.283000  2020-09-02T10:20:46.283000
456456150   5366    2020-09-02T18:25:01.911000  2020-09-02T10:20:46.283000
456456150   5366    2020-09-03T10:20:38.407000  2020-09-03T10:20:38.407000
456456150   5366    2020-09-03T18:23:35.915000  2020-09-03T10:20:38.407000
456456150   5366    2020-09-04T10:19:46.474000  2020-09-04T10:19:46.474000
456456150   5366    2020-09-04T14:22:17.236000  2020-09-04T10:19:46.474000
456456150   5366    2020-09-04T18:24:33.155000  2020-09-04T10:19:46.474000
456456150   5366    2020-09-05T10:19:48.871000  2020-09-05T10:19:48.871000
456456150   5366    2020-09-05T18:25:07.968000  2020-09-05T10:19:48.871000
456456150   5366    2020-09-06T10:19:34.808000  2020-09-06T10:19:34.808000
456456150   5366    2020-09-06T18:26:17.705000  2020-09-06T10:19:34.808000
456456150   5366    2020-09-07T10:21:28.585000  2020-09-07T10:21:28.585000
456456150   5366    2020-09-07T18:24:17.123000  2020-09-07T10:21:28.585000
456456150   5366    2020-09-08T10:20:09.850000  2020-09-08T10:20:09.850000
456456150   5366    2020-09-08T18:24:32.733000  2020-09-08T10:20:09.850000
456456150   5366    2020-09-09T10:20:05.336000  2020-09-09T10:20:05.336000
456456150   5366    2020-09-09T12:12:41.137000  2020-09-09T10:20:05.336000
456456150   5366    2020-09-09T18:24:25.783000  2020-09-09T10:20:05.336000

 
identifier  customer_id date                    FirstV_date1456456150   5366    2020-09-01T10:17:48.360000  2020-09-01T10:17:48.360000456456150   5366    2020-09-01T18:24:45.552000  2020-09-01T10:17:48.360000456456150   5366    2020-09-02T10:20:46.283000  2020-09-02T10:20:46.283000456456150   5366    2020-09-02T18:25:01.911000  2020-09-02T10:20:46.283000456456150   5366    2020-09-03T10:20:38.407000  2020-09-03T10:20:38.407000456456150   5366    2020-09-03T18:23:35.915000  2020-09-03T10:20:38.407000456456150   5366    2020-09-04T10:19:46.474000  2020-09-04T10:19:46.474000456456150   5366    2020-09-04T14:22:17.236000  2020-09-04T10:19:46.474000456456150   5366    2020-09-04T18:24:33.155000  2020-09-04T10:19:46.474000456456150   5366    2020-09-05T10:19:48.871000  2020-09-05T10:19:48.871000456456150   5366    2020-09-05T18:25:07.968000  2020-09-05T10:19:48.871000456456150   5366    2020-09-06T10:19:34.808000  2020-09-06T10:19:34.808000456456150   5366    2020-09-06T18:26:17.705000  2020-09-06T10:19:34.808000456456150   5366    2020-09-07T10:21:28.585000  2020-09-07T10:21:28.585000456456150   5366    2020-09-07T18:24:17.123000  2020-09-07T10:21:28.585000456456150   5366    2020-09-08T10:20:09.850000  2020-09-08T10:20:09.850000456456150   5366    2020-09-08T18:24:32.733000  2020-09-08T10:20:09.850000456456150   5366    2020-09-09T10:20:05.336000  2020-09-09T10:20:05.336000456456150   5366    2020-09-09T12:12:41.137000  2020-09-09T10:20:05.336000456456150   5366    2020-09-09T18:24:25.783000  2020-09-09T10:20:05.336000​

I’ve also tried using self join, but I rather not because that’s very expensive, but, any ideas would be welcome.

Thanks in advance!

Answer

Below is for BigQuery Standard SQL (assumes that date_sent column is of timestamp data type – as it looks like in provided example)

#standardSQL
create temp function get_sessions(x array<timestamp>)
returns array<struct<date_sent timestamp, sessionStart timestamp, session string>>
language js as """
  output = []; session = 1; sessionStart = x[0]; total_dur = 0;  
  a = {}; a.date_sent = x[0]; a.session = session; a.sessionStart = sessionStart;
  output.push(a);
  for(i = 1; i < x.length; i++){
    a = {};
    total_dur += x[i].getTime() - x[i-1].getTime();
    if(total_dur>24*3600*1000){
      total_dur = 0; session++; sessionStart = x[i];  
    };
    a.date_sent = x[i-1]; a.sessionStart = sessionStart; a.session = session;
    output.push(a);
  }
  return output;
""";
select identifier, customer_id, date_sent, sessionStart, session
from (
  select identifier, customer_id, get_sessions(array_agg(date_sent order by date_sent)) sessions
  from `project.dataset.table`
  group by identifier, customer_id
), unnest(sessions)

 
#standardSQLcreate temp function get_sessions(x array<timestamp>)returns array<struct<date_sent timestamp, sessionStart timestamp, session string>>language js as """  output = []; session = 1; sessionStart = x[0]; total_dur = 0;    a = {}; a.date_sent = x[0]; a.session = session; a.sessionStart = sessionStart;  output.push(a);  for(i = 1; i < x.length; i++){    a = {};    total_dur += x[i].getTime() - x[i-1].getTime();    if(total_dur>24*3600*1000){      total_dur = 0; session++; sessionStart = x[i];      };    a.date_sent = x[i-1]; a.sessionStart = sessionStart; a.session = session;    output.push(a);  }  return output;""";select identifier, customer_id, date_sent, sessionStart, sessionfrom (  select identifier, customer_id, get_sessions(array_agg(date_sent order by date_sent)) sessions  from `project.dataset.table`  group by identifier, customer_id), unnest(sessions)​

if to apply to sample data in the question – output is

Important: assumption here – based on your comments – volume of rows per partition (identifier, customer_id) is relatively small (~2K) so js udf memory limit is not a problem here)

Advertisement

Answer