Skip to content
Advertisement

Window functions filter through current row

This is a follow-up to this question, where my query was improved to use window functions instead of aggregates inside a LATERAL join. While the query is now much faster, I’ve found that the results are not correct.

I need to perform computations on x year trailing time frames. For example, price_to_maximum_earnings is computed per row by getting max(earnings) over ten years ago to the current row, and dividing price by the result. We’ll use 1 year for simplicity here.

SQL Fiddle for this question. (Postgres 9.6)

As a simple example, price and peak_earnings for 2010-01-01 could be computed separately like this:

SELECT price
FROM security_data
WHERE date = '2010-01-01'
AND security_id = 'SPX';

SELECT max(earnings) AS min_earnings
FROM bloomberg.security_data
WHERE date >= '2000-01-01'
AND date <= '2010-01-01'
AND security_id = 'SPX';

To do this per row, I use the following:

SELECT security_id, date, price
     , CASE WHEN date1 >= min_date
            THEN price / NULLIF(max(earnings) FILTER (WHERE date >= date1) OVER w, 0) END AS price_to_peak_earnings
FROM
(
  SELECT record_id, security_id, price, date, earnings
           , (date - interval '1 y')::date AS date1
           , min(date) OVER (PARTITION BY security_id) AS min_date
      FROM   security_data
) d
WINDOW w AS (PARTITION BY security_id);

I believe the issue here stems from the use of FILTER, as it doesn’t seem to be working as I want it to. Note that in the linked SQL Fiddle, I’ve displayed the result of the FILTER, and for each row the peak_earnings and minimum_earnings are just the max and min for the entire data set. They should be the max/min values of earnings from 1 year ago to the current row.

What’s going on here? I know from the answer to this question that I can’t simply say FILTER (WHERE date >= date1 AND date <= current_row.date), so is there a solution I’m missing? I cannot use window frames because I have an uncertain number of rows for any given time frame, so I couldn’t just say OVER (ROWS BETWEEN 365 PRECEDING AND CURRENT ROW). Can I use a frame and a filter? That might get over a year previous, then the filter could catch every invalid date. I’ve tried this but have not been successful.

Advertisement

Answer

Can I use a frame and a filter?

You can. But either has restrictions:

  • The expression in the FILTER clause only sees the respective row where it fetches values. There is no way to reference the row for which your window function computes values. So I don’t see a way to formulate a filter depending on that row unless we make a huge, expensive cross join – the same row is used for many different computations. Or we are back to LATERAL subqueries that can reference the parent row.

  • The frame definition on the other hand does not allow variables at all. It demands a fixed number, as discussed in the related answer you referenced:

These restrictions make your particular query hard to implement. This should be correct now:

SELECT *
FROM  (
   SELECT record_id, security_id, date, price
        , CASE WHEN do_calc THEN                max(earnings) OVER w1     END AS peak_earnings
        , CASE WHEN do_calc THEN                min(earnings) OVER w1     END AS minimum_earnings
        , CASE WHEN do_calc THEN price / NULLIF(max(earnings) OVER w1, 0) END AS price_to_peak_earnings
        , CASE WHEN do_calc THEN price / NULLIF(min(earnings) OVER w1, 0) END AS price_to_minimum_earnings
   FROM  (
      SELECT *, (date - 365) >= min_date AND s.record_id IS NOT NULL AS do_calc
      FROM  (
         SELECT security_id, min_date
              , generate_series(min_date, max_date, interval '1 day')::date AS date
         FROM  (
            SELECT security_id, min(date) AS min_date, max(date) AS max_date
            FROM   security_data
            GROUP  BY 1
            ) minmax
         ) d
      LEFT   JOIN  security_data s USING (security_id, date)
      ) sub1
   WINDOW w1 AS (PARTITION BY security_id ORDER BY date ROWS BETWEEN 365 PRECEDING AND 1 PRECEDING)
   ) sub2
WHERE  record_id IS NOT NULL 
ORDER  BY 1, 2;

SQL Fiddle.

Notes

  • Nothing in the question says that every security_id would have rows for the same days. Calculating min / max date per security_id in subquery minmax give us the minimum time frame.

  • The time frame for calculations is exactly 365 day preceding the current date of the row and not including the current row (ROWS BETWEEN 365 PRECEDING AND 1 PRECEDING). It’s typically more useful to exclude the current row from aggregations to be compared with the current row.
    I adapted the condition for calculations to the same time frame to avoid corner case oddities: (date - 365) >= min_date

  • In the fiddle, where you added 1 row for every 1st of Jan, you can see the effect of leapyears contrasting with a fixed number of 365 day. The window frame is empty after leapyears (2001, 2005, …).

  • I am using all subqueries, which is typically a bit faster than CTEs.

  • To be sure, we need to include ORDER BY in the frame definition. I updated my old answer you linked to accordingly:

  • I use w1 as window name, for the “1 year” period. You might add w2, etc. and can have any number of days for each. You could adapt to leapyears after all if you should need to. Might even generate the whole query depending on the current date …

Advertisement