This is a follow-up to this question, where my query was improved to use window functions instead of aggregates inside a LATERAL
join. While the query is now much faster, I’ve found that the results are not correct.
I need to perform computations on x year trailing time frames. For example, price_to_maximum_earnings
is computed per row by getting max(earnings)
over ten years ago to the current row, and dividing price
by the result. We’ll use 1 year for simplicity here.
SQL Fiddle for this question. (Postgres 9.6)
As a simple example, price
and peak_earnings
for 2010-01-01
could be computed separately like this:
SELECT price FROM security_data WHERE date = '2010-01-01' AND security_id = 'SPX'; SELECT max(earnings) AS min_earnings FROM bloomberg.security_data WHERE date >= '2000-01-01' AND date <= '2010-01-01' AND security_id = 'SPX';
To do this per row, I use the following:
SELECT security_id, date, price , CASE WHEN date1 >= min_date THEN price / NULLIF(max(earnings) FILTER (WHERE date >= date1) OVER w, 0) END AS price_to_peak_earnings FROM ( SELECT record_id, security_id, price, date, earnings , (date - interval '1 y')::date AS date1 , min(date) OVER (PARTITION BY security_id) AS min_date FROM security_data ) d WINDOW w AS (PARTITION BY security_id);
I believe the issue here stems from the use of FILTER
, as it doesn’t seem to be working as I want it to. Note that in the linked SQL Fiddle, I’ve displayed the result of the FILTER
, and for each row the peak_earnings
and minimum_earnings
are just the max and min for the entire data set. They should be the max/min values of earnings
from 1 year ago to the current row.
What’s going on here? I know from the answer to this question that I can’t simply say FILTER (WHERE date >= date1 AND date <= current_row.date)
, so is there a solution I’m missing? I cannot use window frames because I have an uncertain number of rows for any given time frame, so I couldn’t just say OVER (ROWS BETWEEN 365 PRECEDING AND CURRENT ROW)
. Can I use a frame and a filter? That might get over a year previous, then the filter could catch every invalid date. I’ve tried this but have not been successful.
Advertisement
Answer
Can I use a frame and a filter?
You can. But either has restrictions:
The expression in the
FILTER
clause only sees the respective row where it fetches values. There is no way to reference the row for which your window function computes values. So I don’t see a way to formulate a filter depending on that row unless we make a huge, expensive cross join – the same row is used for many different computations. Or we are back toLATERAL
subqueries that can reference the parent row.The frame definition on the other hand does not allow variables at all. It demands a fixed number, as discussed in the related answer you referenced:
These restrictions make your particular query hard to implement. This should be correct now:
SELECT * FROM ( SELECT record_id, security_id, date, price , CASE WHEN do_calc THEN max(earnings) OVER w1 END AS peak_earnings , CASE WHEN do_calc THEN min(earnings) OVER w1 END AS minimum_earnings , CASE WHEN do_calc THEN price / NULLIF(max(earnings) OVER w1, 0) END AS price_to_peak_earnings , CASE WHEN do_calc THEN price / NULLIF(min(earnings) OVER w1, 0) END AS price_to_minimum_earnings FROM ( SELECT *, (date - 365) >= min_date AND s.record_id IS NOT NULL AS do_calc FROM ( SELECT security_id, min_date , generate_series(min_date, max_date, interval '1 day')::date AS date FROM ( SELECT security_id, min(date) AS min_date, max(date) AS max_date FROM security_data GROUP BY 1 ) minmax ) d LEFT JOIN security_data s USING (security_id, date) ) sub1 WINDOW w1 AS (PARTITION BY security_id ORDER BY date ROWS BETWEEN 365 PRECEDING AND 1 PRECEDING) ) sub2 WHERE record_id IS NOT NULL ORDER BY 1, 2;
Notes
Nothing in the question says that every
security_id
would have rows for the same days. Calculating min / max date persecurity_id
in subqueryminmax
give us the minimum time frame.The time frame for calculations is exactly 365 day preceding the current date of the row and not including the current row (
ROWS BETWEEN 365 PRECEDING AND 1 PRECEDING
). It’s typically more useful to exclude the current row from aggregations to be compared with the current row.
I adapted the condition for calculations to the same time frame to avoid corner case oddities:(date - 365) >= min_date
In the fiddle, where you added 1 row for every 1st of Jan, you can see the effect of leapyears contrasting with a fixed number of 365 day. The window frame is empty after leapyears (2001, 2005, …).
I am using all subqueries, which is typically a bit faster than CTEs.
To be sure, we need to include
ORDER BY
in the frame definition. I updated my old answer you linked to accordingly:I use
w1
as window name, for the “1 year” period. You might addw2
, etc. and can have any number of days for each. You could adapt to leapyears after all if you should need to. Might even generate the whole query depending on the current date …