When I join to the right table I am getting way too many duplicates. I am trying to grab the most recent record from the right table however, it does not matter what I try it does not work.
So Far I have tried:
PROC SQL; CREATE TABLE fs1.sample AS SELECT A.*, B.xx1, max(B.time_s) FROM lx1.results a left join (Select Distinct C.id, c.per FROM lx2.results c Where c.id = a.id and COMPGED(a.txt1, c.txt1,'i') < 100 and c.dt > a.dt and c.ksv = 37 and datepart(c.lsg) >= '12DEC2020'd ) b ON a.id = b.id group by a.id, a.txt1 QUIT;
Unfortunately, I get an error. I also tried using case when exists, but that takes way too long. Essentially I am trying to grab the most recent record from the right table based on time_s. I also want to make sure the record I grab from the right table somewhat matches a.txt1.
Cheers
Advertisement
Answer
When you perform a join, you attach all records from the table that match your join conditions.
If the table is indexed appropriately, a subquery could achieve the goal of obtaining the most recent value, however, if the query uses the wrong index, TOP or equivalent functions may return the wrong result.
There are a number of ways to accomplish the task of retrieving the most recent record but they are contingent on a couple of things.
Firstly, you need to be able to identify what the most recent row is, usually by a column called CreatedDate or something similar against the IDs. (You should know what that business logic is, it may be that the table is chronologically entered [as most tables are] and therefore, SubID might be a thing. We’re going to assume it is CreatedDate.)
Secondly, you need to rank the rows in terms of the CreatedDate in a descending order so that the newest matching ID is ranked 1.
Finally, you filter your results by 1 to return the newest result, but you could also filter by <= x if you are interested in the top x newest return results per ID.
To use more mathematical language: We are deriving a value from the CreatedDate and ID values and then using that derivative value to sort and filter the data. In this case we are deriving the RowNumber from the CreatedDate in descending order for each ID.
In order to accomplish this, you can use the Windowed Function ROW_NUMBER(),
ROW_NUMBER() OVER (PARTITION BY id ORDER BY CreatedDate DESC) as RankID
This windowed function will return a row value for each ID relative to the CreatedDate in descending order, where the newest created date is equal to 1.
You can then put brackets around the whole query to make it into a table so you will be able to filter the results of that Windowed Function.
SELECT id, txt (SELECT id, txt ,ROW_NUMBER() OVER (PARTITION BY id ORDER BY CreatedDate DESC) as RankID FROM SourceTable) A WHERE RankID = 1
This should achieve your goal of returning the “newest result”.
What ever your column is that determines the age of the data relative to the ID, it can be multiple, should be placed within the ORDER BY.
In order to make this query perform faster, you should index your data appropriately, whereby ID is the the first column, and CreatedDate Desc is your next column. This means your system will not have to perform a costly sort every time this runs, but that depends on whether you plan on using this query often and how much overhead it is grabbing.