I have a sql table that stores the daily prices of stocks. New records are inserted every day after the market closes. I want to find the stocks that have consecutive increases in price.
The table has lots of columns, but this is the relevant subset:
quoteid stockid closeprice createdate -------------------------------------------------- 1 1 1 01/01/2012 2 2 10 01/01/2012 3 3 15 01/01/2012 4 1 2 01/02/2012 5 2 11 01/02/2012 6 3 13 01/02/2012 7 1 5 01/03/2012 8 2 13 01/03/2012 9 3 17 01/03/2012 10 1 7 01/04/2012 11 2 14 01/04/2012 12 3 18 01/04/2012 13 1 9 01/05/2012 14 2 11 01/05/2012 15 3 10 01/05/2012
The quoteid
column is a primary key.
In the table, the closing price of stock id 1 increases every day. Stock id 3 fluctuates a lot, and the price for stock id 2 fell on the last day.
I am looking for a result like this:
stockid Consecutive Count (CC) ---------------------------------- 1 5 2 4
If you can get output with dates for the consecutive streak, that would even better:
stockid Consecutive Count (CC) StartDate EndDate --------------------------------------------------------------- 1 5 01/01/2012 01/05/2012 2 4 01/01/2012 01/04/2012
StartDate
is when the price started increasing and EndDate
is when the bull run actually finished.
I have figured this is not an easy problem. I have looked at other posts here which also deal with this consecutive scenario but they don’t fit my needs. If you know any post that is similar to mine, please do let me know.
Advertisement
Answer
In any case, it helps to put it in terms of increasing rows-per-stock (the actual quoteid
value isn’t really helpful here). Count of days captured (in this table) is easiest – if you want something else (like only business days, ignoring weekends/holidays, or whatever) it gets more involved; you’d probably need a calendar file. You’re going to want an index over [stockid
, createdate
], if you don’t have one already.
WITH StockRow AS (SELECT stockId, closePrice, createdDate, ROW_NUMBER() OVER(PARTITION BY stockId ORDER BY createdDate) rn FROM Quote), RunGroup AS (SELECT Base.stockId, Base.createdDate, MAX(Restart.rn) OVER(PARTITION BY Base.stockId ORDER BY Base.createdDate) groupingId FROM StockRow Base LEFT JOIN StockRow Restart ON Restart.stockId = Base.stockId AND Restart.rn = Base.rn - 1 AND Restart.closePrice > Base.closePrice) SELECT stockId, COUNT(*) AS consecutiveCount, MIN(createdDate) AS startDate, MAX(createdDate) AS endDate FROM RunGroup GROUP BY stockId, groupingId HAVING COUNT(*) >= 3 ORDER BY stockId, startDate
Which yields the following results from the provided data:
Increasing_Run stockId consecutiveCount startDate endDate =================================================== 1 5 2012-01-01 2012-01-05 2 4 2012-01-01 2012-01-04 3 3 2012-01-02 2012-01-04
SQL Fiddle Example
(Fiddle also has an example for multiple runs)
This analysis will ignore all gaps, correctly matches all runs (the next time a positive run starts).
So what’s going on here?
StockRow AS (SELECT stockId, closePrice, createdDate, ROW_NUMBER() OVER(PARTITION BY stockId ORDER BY createdDate) rn FROM Quote)
This CTE is being used for one purpose: we need a way to find the next/previous row, so first we number each row in order (of the date)…
RunGroup AS (SELECT Base.stockId, Base.createdDate, MAX(Restart.rn) OVER(PARTITION BY Base.stockId ORDER BY Base.createdDate) groupingId FROM StockRow Base LEFT JOIN StockRow Restart ON Restart.stockId = Base.stockId AND Restart.rn = Base.rn - 1 AND Restart.closePrice > Base.closePrice)
… and then join them based on the index. If you end up on something that has LAG()
/LEAD()
, using those instead will almost certainly be a better option. There’s one critical thing here though – matches are only if the row is out-of-sequence (less than the previous row). Otherwise, the value end up being null
(with LAG()
, you’d need to use something like CASE
afterwards to pull this off). You get a temporary set that looks something like this:
B.rn B.closePrice B.createdDate R.rn R.closePrice R.createdDate groupingId 1 15 2012-01-01 - - - - 2 13 2012-01-02 1 15 2012-01-01 1 3 17 2012-01-03 - - - 1 4 18 2012-01-04 - - - 1 5 10 2012-01-05 4 18 2012-01-04 4
… So there’s values for Restart
only when the previous was greater than the “current” row. The use of MAX()
in the window function is being used to the greatest value seen so far… which because null
is lowest, causes the row-index to be retained for all other rows until another mismatch occurs (which gives a new value). At this point, we essentially have the intermediate results of a gaps-and-islands query, ready for the final aggregation.
SELECT stockId, COUNT(*) AS consecutiveCount, MIN(createdDate) AS startDate, MAX(createdDate) AS endDate FROM RunGroup GROUP BY stockId, groupingId HAVING COUNT(*) >= 3 ORDER BY stockId, startDate
The final part of the query is getting the start and end dates of the run, and counting the number of entries between those dates. If there was something more complicated for the date calculation, it probably needs to happen at this point. The GROUP BY
is showing one of the few legitimate instances of not including a column in the SELECT
clause. The HAVING
clause is used to eliminate runs that are “too short”.