I’m working on a dataset with anonymized location data and I receive a series of time-stamp intervals at various points of interest. I’m trying to calculate the dwell time for an individual within a point of interest. I’ve tried a simple method of max(timestamp) – min(timestamp) to calculate a time difference. This works well for many of the devices in the dataset, but not if the user returns to a place multiple times. I’m missing some step to group the data whenever there is a change in location and calculate the interval for that group of timestamps.
Here is a subset of the data:
ID, Location, UnixTimestamp
IDABCDE,"Place_3",2018-03-16 07:42:39
IDABCDE,"Place_2",2018-03-16 12:50:22
IDABCDE,"Place_2",2018-03-16 12:50:23
IDABCDE,"Place_2",2018-03-16 12:50:28
IDABCDE,"Place_1",2018-03-16 12:58:27
IDABCDE,"Place_2",2018-03-16 12:58:27
IDABCDE,"Place_2",2018-03-16 13:02:45
IDABCDE,"Place_1",2018-03-16 13:02:45
IDABCDE,"Place_2",2018-03-16 14:05:47
IDABCDE,"Place_2",2018-03-16 14:05:54
IDABCDE,"Place_2",2018-03-16 14:05:57
IDABCDE,"Place_2",2018-03-16 14:10:19
IDABCDE,"Place_2",2018-03-16 14:11:13
IDABCDE,"Place_1",2018-03-16 14:11:13
IDABCDE,"Place_3",2018-03-16 14:13:03
IDABCDE,"Place_3",2018-03-16 14:16:01
IDABCDE,"Place_3",2018-03-16 14:16:01
IDABCDE,"Place_3",2018-03-16 14:16:08
IDABCDE,"Place_3",2018-03-16 14:16:14
IDABCDE,"Place_3",2018-03-16 14:17:08
IDABCDE,"Place_3",2018-03-16 14:17:31
IDABCDE,"Place_3",2018-03-16 14:18:07
IDABCDE,"Place_3",2018-03-16 14:18:07
IDABCDE,"Place_3",2018-03-16 14:18:07
IDABCDE,"Place_3",2018-03-16 15:02:04
IDABCDE,"Place_3",2018-03-16 15:02:04
IDABCDE,"Place_3",2018-03-16 15:02:19
IDABCDE,"Place_3",2018-03-16 15:03:17
IDABCDE,"Place_3",2018-03-16 15:03:17
IDABCDE,"Place_2",2018-03-16 15:05:34
The query should return a duration for the grouping of values for the first instance of Place 2, 0 for Place 1 (only one observation), calculate a second dwell time for place_2, 0 for a second visit to place_1, then a dwell time for Place_3.
I’ve tried variations on the following:
SELECT ID, location,
min(Unix_Timestamp) as first_observation, max(Unix_Timestamp) as last_observation,
TIMESTAMP_DIFF(max(Unix_Timestamp),min(Unix_Timestamp), minute) as time_in_location
FROM `table_name`
GROUP BY ID, location
This returns the time difference between the first and last values of each place (which is what the query is asking for, but isn’t quite what I need).
I believe I need a window function, but trying different Partition_By values still isn’t returning the correct result.
Any insight you have is greatly appreciated. Thank you.
Advertisement
Answer
This is the segmentation code you’re looking for:
SELECT *, SUM(x) OVER(PARTITION BY id ORDER BY unix_timestamp, location) segmentId
FROM (
SELECT *, IF(LAG(location) OVER(PARTITION BY id ORDER BY unix_timestamp, location)=location,0,1) x
FROM data
)
With this your existing code will run as expected with a minimal modification: GROUP BY ..., segment_id
.
WITH data AS (
SELECT REGEXP_EXTRACT(x, r'([^,]*),') id
, REGEXP_EXTRACT(x, r',([^,]*),') location
, TIMESTAMP(REGEXP_EXTRACT(x, r',([^,]*)$')) unix_timestamp
FROM UNNEST(SPLIT("""IDABCDE,"Place_3",2018-03-16 07:42:39
IDABCDE,"Place_2",2018-03-16 12:50:22
IDABCDE,"Place_2",2018-03-16 12:50:23
IDABCDE,"Place_2",2018-03-16 12:50:28
IDABCDE,"Place_1",2018-03-16 12:58:27
IDABCDE,"Place_2",2018-03-16 12:58:27
IDABCDE,"Place_2",2018-03-16 13:02:45
IDABCDE,"Place_1",2018-03-16 13:02:45
IDABCDE,"Place_2",2018-03-16 14:05:47
IDABCDE,"Place_2",2018-03-16 14:05:54
IDABCDE,"Place_2",2018-03-16 14:05:57
IDABCDE,"Place_2",2018-03-16 14:10:19
IDABCDE,"Place_2",2018-03-16 14:11:13
IDABCDE,"Place_1",2018-03-16 14:11:13
IDABCDE,"Place_3",2018-03-16 14:13:03
IDABCDE,"Place_3",2018-03-16 14:16:01
IDABCDE,"Place_3",2018-03-16 14:16:01
IDABCDE,"Place_3",2018-03-16 14:16:08
IDABCDE,"Place_3",2018-03-16 14:16:14
IDABCDE,"Place_3",2018-03-16 14:17:08
IDABCDE,"Place_3",2018-03-16 14:17:31
IDABCDE,"Place_3",2018-03-16 14:18:07
IDABCDE,"Place_3",2018-03-16 14:18:07
IDABCDE,"Place_3",2018-03-16 14:18:07
IDABCDE,"Place_3",2018-03-16 15:02:04
IDABCDE,"Place_3",2018-03-16 15:02:04
IDABCDE,"Place_3",2018-03-16 15:02:19
IDABCDE,"Place_3",2018-03-16 15:03:17
IDABCDE,"Place_3",2018-03-16 15:03:17
IDABCDE,"Place_2",2018-03-16 15:05:34""", 'n')) x
)
, dataprep AS (
SELECT *, SUM(x) OVER(ORDER BY unix_timestamp, location) segmentId
FROM (
SELECT *, IF(LAG(location) OVER(ORDER BY unix_timestamp, location)=location,0,1) x
FROM data
)
)
SELECT
ID, location,
min(Unix_Timestamp) as first_observation, max(Unix_Timestamp) as last_observation,
TIMESTAMP_DIFF(max(Unix_Timestamp),min(Unix_Timestamp), second) as time_in_location
, COUNT(*) steps
FROM dataprep
GROUP BY ID, location, segmentId