Skip to content
Advertisement

SQL aggregation based on time and location for Sankey graph

I have a situation where I’m trying to get data to populate a Sankey graph.

I have the data with timestamp and the location of the person when captured by the system.

The normal case is when a person location changes and in that case, “from” should be that location and “to” should be that person’s next entry provided it’s less than 2h difference from “from”.

Original data:

date,   macAdress,         zone
8h10m,  00-B0-D0-63-C2-26, room1
8h12m,  00-B0-D0-63-C2-26, hall
8h15m,  00-A0-B0-23-T2-22, room1
8h16m,  00-A0-B0-23-T2-22, meeting2
8h18m,  00-B0-D0-63-C2-26, meeting2
8h25m,  00-A0-B0-23-T2-22, cafetaria
8h30m,  00-G5-A8-44-T2-30, room1
8h34m,  00-G5-A8-44-T2-30, meeting2
14h05m, 00-G5-A8-44-T2-30, cafetaria

Result required in the following way (or similar):

from,     to,        count
room1,    hall,      1
hall,     meeting2,  2
room1,    meeting2,  1
meeting2, cafetaria, 1 <-- special case as time from last zone is longer than 2h it didn't count "14h05m, 00-G5-A8-44-T2-30, cafetaria"

Can you give me an idea of how can I get such result in SQL? I’m using BigQuery, but I believe standard SQL should do the job.

Thanks,

Diogo

Advertisement

Answer

Consider below

with temp as (
  select *, parse_time('%Hh%Mm', date) time
  from your_table
), from_to as (
  select 
    zone as _from, lead(zone) over win as _to,
    time_diff(lead(time) over win, time, minute) as duration
  from temp
  window win as (partition by macAdress order by time)
)
select _from, _to, count(*) _count
from from_to
where not _to is null
and duration < 120
group by _from, _to

if applied to sample data as in your question

enter image description here

output is

enter image description here

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement