Is there any special reason that SQL only implements FIRST_VALUE and LAST_VALUE as a windowing function instead of an aggregation function? I find it quite common to encounter problems such as “find the item with highest price in each category”. While other languages (such as python) provide MIN/MAX functions with keywords such that
MAX(item_name, key=lambda x: revenue[x])
is possible, In SQL the only way to tackle this problem seems to be:
WITH temp as( SELECT *, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv FROM catalog) SELECT category, MAX(fv) -- MIN(fv) also OK FROM temp GROUP BY category;
Is there a special reason that there is no “aggregation version” of FIRST_VALUE such that
SELECT category, FIRST_VALUE(item_name, revenue) FROM catalog GROUP BY category
or is it just the way it is?
Advertisement
Answer
That’s just the way it is, as far as I’m concerned. I suspect the only real answer would be “because it’s not in the SQL spec” and the only people who could really answer as to why it’s not in the spec are the people who write it. Questions of the form “what was (name of relevant external authority) thinking when they mandated that (name of product) should operate like this” are actually typically off topic here because very few people can reliably and factually answer.. I don’t even like my own answer here, as it feels like an extended comment on a question that cannot realistically be answered
Aggregate functions work on sets of data and while some of them might require some implied ordering operation such as median, the functions are always about the column they’re operating on, not a “give me the value of this column based on the ordering of that column”.
There are plenty of window/analytic functions that don’t have a corollary aggregation version, and window functions have a different end use intent than aggregation. You could conceive that some of them perform aggregation and then join the aggregation result back to the main data in order to relate the agg result to the particular row, but I wouldn’t assume the two facilities (agg vs window) are related at all
As far as I understand the python (not a python dev), it is not doing any aggregation, it’s searching a list of item_name strings and looking each up in a dictionary that returns the revenue for that item, and returning the item_name that has the largest revenue. There wasn’t any grouping there, it’s much more like a SELECT TOP 1 item_name ORDER BY revenue and is only really good for returning a single item, rather than a load of items that are all maxes within their group, unless it’s used within a loop that is processing a different list of item name each time
I know your question wasn’t exactly about this particular SQL query but it may be helpful for you if I mention a couple of things on it. I’m not really sure what:
WITH temp as( SELECT *, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv FROM catalog ) SELECT category, MAX(fv) -- MIN(fv) also OK FROM temp GROUP BY category;
Gives you over something like:
SELECT DISTINCT category, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv FROM catalog
The analytic/window will produce the same value for every category (the partition) so it seems that really all the extra group by is doing is reducing the repeated values – which could be more simply answered by just getting the values you want and using distinct to quash the duplicates (one of the few cases where I would advocate such)
In the more general sense of “I want the entire most X row as determined by highest/lowest Y” we typically use row number for that:
WITH temp as( SELECT *, ROW_NUMBER(item_name) OVER(PARTITION BY category ORDER BY revenue) as rn FROM catalog) SELECT * FROM temp WHERE rn = 1;
Though I find it more compact/readable to dispense with the CTE and just use a sub query but YMMV