Given a table with entries at irregular time stamps, “breaks” must be inserted at regular 5 min intervals ( the data associated can / will be NULL ).
I was thinking of getting the start time, making a subquery that has a window function and adds 5 min intervals to the start time – but I only could think of using row_number to increment the values.
WITH data as( select id, data, cast(date_and_time as double) * 1000 as time_milliseconds from t1), -- original data start_times as( select id, MIN(CAST(date_and_time as double) * 1000) as start_time from t1 GROUP BY id ), -- first timestamp for each id boundries as ( SELECT T1.id,(row_number() OVER (PARTITION BY T1.id ORDER BY T1.date_and_time)-1) *300000 + start_times.start_time as boundry from T1 INNER JOIN start_times ON start_times.id= T1.id ) -- increment the number of 5 min added on each row and later full join boundries table with original data
However this limits me to the number of rows present for an id in the original data table, and if the timestamps are spread out, the number of rows cannot cover the amount of 5 min intervals needed to be added.
sample data:
initial data: |-----------|------------------|------------------| | id | value | timestamp | |-----------|------------------|------------------| | 1 | 3 | 12:00:01.011 | |-----------|------------------|------------------| | 1 | 4 | 12:03:30.041 | |-----------|------------------|------------------| | 1 | 5 | 12:12:20.231 | |-----------|------------------|------------------| | 1 | 3 | 15:00:00.312 | data after my query: |-----------|------------------|------------------| | id | value | timestamp (UNIX) | |-----------|------------------|------------------| | 1 | 3 | 12:00:01 | |-----------|------------------|------------------| | 1 | 4 | 12:03:30 | |-----------|------------------|------------------| | 1 | NULL | 12:05:01 | <-- Data from "boundries" |-----------|------------------|------------------| | 1 | NULL | 12:10:01 | <-- Data from "boundries" |-----------|------------------|------------------| | 1 | 5 | 12:12:20 | |-----------|------------------|------------------| | 1 | NULL | 12:15:01 | <-- Data from "boundries" |-----------|------------------|------------------| | 1 | NULL | 12:20:01 | <-- Data from "boundries" |-----------|------------------|------------------| <-- Jumping directly to 15:00:00 (WRONG! :( need to insert more 5 min breaks here ) | 1 | 3 | 15:00:00 |
I was thinking of creating a temporary table inside HIVE and filling it with x rows representing 5 min intervals from the starttime to the endtime of the data table, but I couldn’t find any way of accomplishing that.
Any way of using “for loops” ? Any suggestions would be appreciated.
Thanks
Advertisement
Answer
You can try calculating the difference between current timestamp and next one, divide 300 to get number of ranges, produce a string of spaces with length = num_ranges, explode to generate rows.
Demo:
with your_table as (--initial data example select stack (3, 1,3 ,'2020-01-01 12:00:01.011', 1,4 ,'2020-01-01 12:03:30.041', 1,5 ,'2020-01-01 12:20:20.231' ) as (id ,value ,ts ) ) select id ,value, ts, next_ts, diff_sec,num_intervals, from_unixtime(unix_timestamp(ts)+h.i*300) new_ts, coalesce(from_unixtime(unix_timestamp(ts)+h.i*300),ts) as calculated_timestamp from ( select id ,value ,ts, next_ts, (unix_timestamp(next_ts)-unix_timestamp(ts)) diff_sec, floor((unix_timestamp(next_ts)-unix_timestamp(ts))/300 --diff in seconds/5 min ) num_intervals from ( select id ,value ,ts, lead(ts) over(order by ts) next_ts from your_table ) s )s lateral view outer posexplode(split(space(cast(s.num_intervals as int)),' ')) h as i,x --this will generate rows
Result:
id value ts next_ts diff_sec num_intervals new_ts calculated_timestamp 1 3 2020-01-01 12:00:01.011 2020-01-01 12:03:30.041 209 0 2020-01-01 12:00:01 2020-01-01 12:00:01 1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:03:30 2020-01-01 12:03:30 1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:08:30 2020-01-01 12:08:30 1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:13:30 2020-01-01 12:13:30 1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:18:30 2020-01-01 12:18:30 1 5 2020-01-01 12:20:20.231 N N N N 2020-01-01 12:20:20.231
Additional rows were added. I left all intermediate columns for debugging purposes.