I have data that looks like (over 100.000 rows):
x
timestamp Location person
2017-09-04 08:07:00 UTC A x
2017-09-04 08:08:00 UTC B y
2017-09-04 08:09:00 UTC A y
2017-09-04 08:07:00 UTC A x
2017-09-04 08:27:00 UTC B x
What I would like:
Location Nr_of_persons_working_at_the_same_time
A 2
B 1
Explanation
timestamp Location person
2017-09-04 08:07:00 UTC A x <--- first action in A by person x
2017-09-04 08:08:00 UTC B y <--- different first action in B by person y
2017-09-04 08:09:00 UTC A y <--- second action in A, but could be different action as person x might be gone
2017-09-04 08:07:00 UTC A x <--- person x is still there, so count of persons in A is 2
2017-09-04 08:27:00 UTC B x <--- not a different action, person x coming in after 20 minutes, count of persons working at the same time remains 1
CONTEXT
I want to find out how many people (person) are working in the same location (Location) by looking at a time window (timestamp) of max 10 minutes and checking whether a person is really working simultaneously or just taking over their shift within that frame. I get the data with a SQL query and could parse it using either SQL or Python. SQL is preferred.
TRIED SOLUTIONS
- Grouping by location, timestamp leads to ‘hard cuts’
- Probably need an so-called window function. But after sorting on timestamp, how do I prevent Locations mixing up?
Note: If easier, I can also try to do this in Python, but I rather not given the size of the dataset and my limited options to do it in the cloud.
Advertisement
Answer
This should work
with mytable as (
select cast('2017-09-04 08:07:00' as datetime) as _timestamp ,'A' as Location,'x' as person union all
select cast('2017-09-04 08:08:00' as datetime) as _timestamp ,'B' as Location,'y' as person union all
select cast('2017-09-04 08:09:00' as datetime) as _timestamp ,'A' as Location,'y' as person union all
select cast('2017-09-04 08:07:00' as datetime) as _timestamp ,'A' as Location,'x' as person union all
select cast('2017-09-04 08:27:00' as datetime) as _timestamp ,'B' as Location,'x' as person
),
sorted_entry
as (
select *,
ifnull(first_value(_timestamp) over(partition by Location order by _timestamp ),_timestamp ) as prev_timestamp ,
ifnull(lag(person) over(partition by Location order by _timestamp ),person ) as another_person
from mytable
)
,flagged
as
(
select *,
case when person <> another_person then (
case when datetime_diff(_timestamp,prev_timestamp,minute) <= 10 then 1
else 0 end
)
else 0
end as flag
from sorted_entry
)
select location ,sum(flag) + 1 as _count
from flagged
group by location