I have a Table in Impala in which I have time information as Unix-Time (with a frequency of 1 mSec) and information about three variables, like given below:
ts Val1 Val2 Val3 1.60669E+12 7541.76 0.55964607 267.1613 1.60669E+12 7543.04 0.5607262 267.27805 1.60669E+12 7543.04 0.5607241 267.22308 1.60669E+12 7543.6797 0.56109643 267.25974 1.60669E+12 7543.6797 0.56107396 267.30624 1.60669E+12 7543.6797 0.56170875 267.2643
I want to resample the data and to get the last value of the new time window. For example, if I want to resample as 10Sec frequency the output should be the last value of 10Sec window, like given below:
ts val1_Last Val2_Last Val3_Last 2020-11-29 22:30:00 7541.76 0.55964607 267.1613 2020-11-29 22:30:10 7542.3994 0.5613486 267.31238 2020-11-29 22:30:20 7542.3994 0.5601791 267.22842 2020-11-29 22:30:30 7544.32 0.56069416 267.20248
To have this result, I am running the following query:
select distinct * from ( select ts, last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1, last_value(Val2) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val2, last_value(Val3) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val3 from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts , Val1 as Val1, Val2 as Val2, Val3 as Val3 FROM Sensor_Data.Table where unit='Unit1' and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt order by ts
I have read on some forums that LAST_VALUE()
sometimes cause problem, so I tried to achieve the same thing using FIRST_VALUE
with ORDER BY DESC
. The query is given below:
select distinct * from ( select ts, first_value(Val1) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val1, first_value(Val2) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val2, first_value(Val3) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val3 from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts , Val1 as Val1, val2 as Val2, Val3 as Val3 FROM product_sofcdtw_ops.as_operated_full_backup where unit='FCS05-09' and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt order by ts
But in both cases, I am not getting the result as expected. The resampled time ts
appeared as expected (with a window of 10Sec) but I am getting random values for Val1
, Val2
and Val3
between 0-9sec, 10-19Sec, … windows.
Logic wise this query looks fine and I didnÄt find any problem. Could anybody explain that why I am not getting the right answer using this query.
Thanks !!!
Advertisement
Answer
The problem is this line:
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,
You are partitioning and ordering by the same column, ts
— so there is no ordering (or more specifically ordering by a value that is constant throughout the partition results in an arbitrary ordering). You need to preserve the original ts to make this work, using that for ordering:
select ts, last_value(Val1) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val1, last_value(Val2) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val2, last_value(Val3) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val3 from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts_10, t.* FROM Sensor_Data.Table t WHERE unit = 'Unit1' AND cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00' ) t
Incidentally, the issue with last_value()
is that it has unexpected behavior when you leave out the window frame (the rows
or range
part of the window function specification).
The issue is that the default specification is range between unbounded preceding and current row
, meaning that last_value()
just picks up the value in the current row.
On the other hand, first_value()
works fine with the default frame. However, both are equivalent if you include an explicit frame.