I intend to interpolate (linear interpolation) between values in a column and insert that into a new column using a SQL query. Based on my search online, I suspect LEAD analytic function could be useful. I am new to writing SQL queries. So, any insights on how it can be achieved will be quite helpful.
The sample data set is as described:
Emp Test_date Value --- --------- ----- A 1/1/2001 null A 1/2/2001 100 A 1/3/2001 null A 1/4/2001 80 A 1/5/2001 null A 1/6/2001 null A 1/7/2001 75
The idea is to get a fourth column that has values as:
null 100 interpolatedValue1 80 interpolatedValue2 interpolatedValue3 75
Interpolatedvalue1 would be an interpolated value between 100 and 80,
Interpolatedvalue2 would be a linearly interpolated value between 80 and 75.
InterpolatedValue3 would be a linearly interpolated value between Interpolatedvalue2 and 75
Here’s how a simple linear interpolation works:
Given two points (V1 at D1), (V3 at D3). What would be the value V2 at D2?
(V3-V1)/(D3-D1) * (D2-D1) + V1
Advertisement
Answer
This can probably be simplified a bit but gets the answer you wanted, I believe. The slightly tricky bit is getting both the number of days between not-null values (i.e. the size of the gap you’re filling) and then the position within that gap:
-- CTE for sample data with your_table (emp, test_date, value) as ( select 'A', date '2001-01-01', null from dual union all select 'A', date '2001-01-02', 100 from dual union all select 'A', date '2001-01-03', null from dual union all select 'A', date '2001-01-04', 80 from dual union all select 'A', date '2001-01-05', null from dual union all select 'A', date '2001-01-06', null from dual union all select 'A', date '2001-01-07', 75 from dual ) -- actual query select emp, test_date, value, coalesce(value, (next_value - prev_value) -- v3-v1 / (count(*) over (partition by grp) + 1) -- d3-d1 * row_number() over (partition by grp order by test_date desc) -- d2-d1, indirectly + prev_value -- v1 ) as interpolated from ( select emp, test_date, value, last_value(value ignore nulls) over (partition by emp order by test_date) as prev_value, first_value(value ignore nulls) over (partition by emp order by test_date range between current row and unbounded following) as next_value, row_number() over (partition by emp order by test_date) - row_number() over (partition by emp order by case when value is null then 1 else 0 end, test_date) as grp from your_table ) order by test_date;
E TEST_DATE VALUE INTERPOLATED - ---------- ---------- ------------ A 2001-01-01 A 2001-01-02 100 100 A 2001-01-03 90 A 2001-01-04 80 80 A 2001-01-05 76.6666667 A 2001-01-06 78.3333333 A 2001-01-07 75 75
I’ve used last_value
and first_value
instead of lead
and lag
, but either works. (Lead/lag might be faster on a large data set I suppose). The grp
calculation is Tabibitosan.