I have a situation where I’m trying to get data to populate a Sankey graph.
I have the data with timestamp and the location of the person when captured by the system.
The normal case is when a person location changes and in that case, “from” should be that location and “to” should be that person’s next entry provided it’s less than 2h difference from “from”.
Original data:
date, macAdress, zone 8h10m, 00-B0-D0-63-C2-26, room1 8h12m, 00-B0-D0-63-C2-26, hall 8h15m, 00-A0-B0-23-T2-22, room1 8h16m, 00-A0-B0-23-T2-22, meeting2 8h18m, 00-B0-D0-63-C2-26, meeting2 8h25m, 00-A0-B0-23-T2-22, cafetaria 8h30m, 00-G5-A8-44-T2-30, room1 8h34m, 00-G5-A8-44-T2-30, meeting2 14h05m, 00-G5-A8-44-T2-30, cafetaria
Result required in the following way (or similar):
from, to, count room1, hall, 1 hall, meeting2, 2 room1, meeting2, 1 meeting2, cafetaria, 1 <-- special case as time from last zone is longer than 2h it didn't count "14h05m, 00-G5-A8-44-T2-30, cafetaria"
Can you give me an idea of how can I get such result in SQL? I’m using BigQuery, but I believe standard SQL should do the job.
Thanks,
Diogo
Advertisement
Answer
Consider below
with temp as ( select *, parse_time('%Hh%Mm', date) time from your_table ), from_to as ( select zone as _from, lead(zone) over win as _to, time_diff(lead(time) over win, time, minute) as duration from temp window win as (partition by macAdress order by time) ) select _from, _to, count(*) _count from from_to where not _to is null and duration < 120 group by _from, _to
if applied to sample data as in your question
output is