I have a query where I’m trying to select the most recent row based on date. So what I would like my query to do is to select the most recent datapoints for a couple columns in my database.
sql = f"""SELECT DISTINCT ON (id), cast(marketcap as money), cast(week52high as money), cast(week52low as money) , to_char(dividend_yield * 100, '99D99%%') , pe_ratio, ROUND(beta,2) FROM security_stats WHERE security_stats.id IN ({placeholders}); """
For anyone wondering, placeholders
is just a list of stocks given by the user.
The goal of this application is to retrieve the most recent observations for financial data points. My query works fine but instead it selects all of the rows per stock, when I just want the the most recent row, per stock.
How can I edit my query to fix the above? I have tried some methods using MAX
and LIMIT
but couldn’t get it to work!.
EDIT:
Note that the columns working with are id
and date
. These should be distinct, and the date should be the most recent observation.
Got it to work thanks to @GME, using this method below. Special thanks too @ Somy for showing me another method as well!
sql = f"""SELECT DISTINCT ON (id) id, cast(marketcap as money), cast(week52high as money), cast(week52low as money) , to_char(dividend_yield * 100, '99D99%%') , pe_ratio, ROUND(beta,2) FROM security_stats WHERE security_stats.id IN ({placeholders}) ORDER BY id, date desc ; """
Advertisement
Answer
The column (or set of columns) in the distinct on
clause defines the groups: if you want one row per “stock”, then column stock should belong to that clause.
Then, you need an order by
: it starts with the distinct on
column(s), followed by one or more columns that define which row should be kept in each group: this is where the “date” column goes.
Your question does not explain what are the names of these to crucial columns in the table. Assuming that we have id
and date
, the logic would be:
SELECT DISTINCT ON (id) ... FROM security_stats WHERE ... ORDER BY id, date desc