SQL aggregation based on time and location for Sankey graph

I have a situation where I’m trying to get data to populate a Sankey graph.

I have the data with timestamp and the location of the person when captured by the system.

The normal case is when a person location changes and in that case, “from” should be that location and “to” should be that person’s next entry provided it’s less than 2h difference from “from”.

Original data:

date,   macAdress,         zone
8h10m,  00-B0-D0-63-C2-26, room1
8h12m,  00-B0-D0-63-C2-26, hall
8h15m,  00-A0-B0-23-T2-22, room1
8h16m,  00-A0-B0-23-T2-22, meeting2
8h18m,  00-B0-D0-63-C2-26, meeting2
8h25m,  00-A0-B0-23-T2-22, cafetaria
8h30m,  00-G5-A8-44-T2-30, room1
8h34m,  00-G5-A8-44-T2-30, meeting2
14h05m, 00-G5-A8-44-T2-30, cafetaria

​x
 
date,   macAdress,         zone8h10m,  00-B0-D0-63-C2-26, room18h12m,  00-B0-D0-63-C2-26, hall8h15m,  00-A0-B0-23-T2-22, room18h16m,  00-A0-B0-23-T2-22, meeting28h18m,  00-B0-D0-63-C2-26, meeting28h25m,  00-A0-B0-23-T2-22, cafetaria8h30m,  00-G5-A8-44-T2-30, room18h34m,  00-G5-A8-44-T2-30, meeting214h05m, 00-G5-A8-44-T2-30, cafetaria​

Result required in the following way (or similar):

from,     to,        count
room1,    hall,      1
hall,     meeting2,  2
room1,    meeting2,  1
meeting2, cafetaria, 1 <-- special case as time from last zone is longer than 2h it didn't count "14h05m, 00-G5-A8-44-T2-30, cafetaria"

 
from,     to,        countroom1,    hall,      1hall,     meeting2,  2room1,    meeting2,  1meeting2, cafetaria, 1 <-- special case as time from last zone is longer than 2h it didn't count "14h05m, 00-G5-A8-44-T2-30, cafetaria"​

Can you give me an idea of how can I get such result in SQL? I’m using BigQuery, but I believe standard SQL should do the job.

Thanks,

Diogo

Answer

Consider below

with temp as (
  select *, parse_time('%Hh%Mm', date) time
  from your_table
), from_to as (
  select 
    zone as _from, lead(zone) over win as _to,
    time_diff(lead(time) over win, time, minute) as duration
  from temp
  window win as (partition by macAdress order by time)
)
select _from, _to, count(*) _count
from from_to
where not _to is null
and duration < 120
group by _from, _to

 
with temp as (  select *, parse_time('%Hh%Mm', date) time  from your_table), from_to as (  select     zone as _from, lead(zone) over win as _to,    time_diff(lead(time) over win, time, minute) as duration  from temp  window win as (partition by macAdress order by time))select _from, _to, count(*) _countfrom from_towhere not _to is nulland duration < 120group by _from, _to​

if applied to sample data as in your question

output is

Advertisement

Answer