Skip to content
Advertisement

Impala Last_Value() Not giving result as expected

I have a Table in Impala in which I have time information as Unix-Time (with a frequency of 1 mSec) and information about three variables, like given below:

ts          Val1        Val2        Val3        
1.60669E+12 7541.76     0.55964607  267.1613        
1.60669E+12 7543.04     0.5607262   267.27805       
1.60669E+12 7543.04     0.5607241   267.22308       
1.60669E+12 7543.6797   0.56109643  267.25974       
1.60669E+12 7543.6797   0.56107396  267.30624       
1.60669E+12 7543.6797   0.56170875  267.2643    

I want to resample the data and to get the last value of the new time window. For example, if I want to resample as 10Sec frequency the output should be the last value of 10Sec window, like given below:

ts                      val1_Last       Val2_Last       Val3_Last   
2020-11-29 22:30:00     7541.76         0.55964607      267.1613
2020-11-29 22:30:10     7542.3994       0.5613486       267.31238
2020-11-29 22:30:20     7542.3994       0.5601791       267.22842
2020-11-29 22:30:30     7544.32         0.56069416      267.20248

To have this result, I am running the following query:

select distinct *
from (
select ts,
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1, 
last_value(Val2) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val2,
last_value(Val3) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val3 
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts , 
Val1 as Val1, 
Val2 as Val2, 
Val3 as Val3
FROM Sensor_Data.Table where unit='Unit1'  
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt 
order by ts

I have read on some forums that LAST_VALUE() sometimes cause problem, so I tried to achieve the same thing using FIRST_VALUE with ORDER BY DESC. The query is given below:

select distinct *
from (
select ts,
first_value(Val1) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val1, 
first_value(Val2) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val2,
first_value(Val3) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts , 
Val1 as Val1, 
val2 as Val2, 
Val3 as Val3
FROM product_sofcdtw_ops.as_operated_full_backup where unit='FCS05-09'  
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt 
order by ts

But in both cases, I am not getting the result as expected. The resampled time ts appeared as expected (with a window of 10Sec) but I am getting random values for Val1, Val2 and Val3 between 0-9sec, 10-19Sec, … windows.

Logic wise this query looks fine and I didnÄt find any problem. Could anybody explain that why I am not getting the right answer using this query.

Thanks !!!

Advertisement

Answer

The problem is this line:

last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1, 

You are partitioning and ordering by the same column, ts — so there is no ordering (or more specifically ordering by a value that is constant throughout the partition results in an arbitrary ordering). You need to preserve the original ts to make this work, using that for ordering:

select ts,
        last_value(Val1) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val1, 
        last_value(Val2) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val2,
        last_value(Val3) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val3 
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts_10, 
             t.*
      FROM Sensor_Data.Table t
      WHERE unit = 'Unit1' AND
            cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00'
     ) t

Incidentally, the issue with last_value() is that it has unexpected behavior when you leave out the window frame (the rows or range part of the window function specification).

The issue is that the default specification is range between unbounded preceding and current row, meaning that last_value() just picks up the value in the current row.

On the other hand, first_value() works fine with the default frame. However, both are equivalent if you include an explicit frame.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement