We have a table of expected payments and a table of payments made. We need to be able to match payments with a singular expected payment but we allow a +-3 day window for it to be made. On top of that, it should be a one-to-one match.
So imagine I have a table of expected payments
2020-10-01 2020-10-04 2020-10-05 2020-10-20
and payments
2020-10-02 2020-10-06 2020-10-07
The result I want is
Expected Made 2020-10-01 2020-10-02 2020-10-04 2020-10-06 2020-10-05 2020-10-07 2020-10-20
and if the payment of the 6th is removed the result would be
Expected Made 2020-10-01 2020-10-02 2020-10-04 2020-10-07 2020-10-05 2020-10-20
so the matching between the 5th and the payment on the 7th depends on whether the payment matched with the 4th. As does the matching between the 4th and 7th depend on if the 4th was matched with the 6th.
I’ve currently achieved this by doing a full join with the matching and then recursively iterating over it to clean out repeated records from both sides. unfortunately as the data in this case is in the 100s of millions of rows, it takes about 40 minutes to churn.
I’m wondering if there is a better way or a built in join that I’ve not come across to achieve this concept of distinct matching.
Advertisement
Answer
..the same logic, as your current approach, iteration through expected dates and find any matching payment date which is greater than the last used payment date. It is a contrived example (based on the simplified requirement) and it might not fit well to the complete/full specs (lots of intricacies are hidden underneath the uniqueness of the dates.. do not get tricked just because the example executes fast).
/* drop table if exists expected go drop table if exists payment go */ --expected create table expected(expd date primary key clustered) go declare @d date = isnull((select max(expd) from expected), '20100101') begin try insert into expected(expd) select dateadd(day, abs(checksum(newid()))%5, @d) end try begin catch --ignore errors, .... just a couple sample rows end catch go 5000 ---------------- --payment create table payment(payd date primary key clustered) go declare @d date = isnull((select max(payd) from payment), '20100101') begin try insert into payment(payd) select dateadd(day, abs(checksum(newid()))%10, @d) end try begin catch --ignore errors, .... just a couple sample rows end catch go 5000 --max recursion == number of rows in expected - 1 (1 is the anchor) declare @maxrecursion int = (select count(*) from expected)-1 /* -1==anchor*/; --maximum difference between consecutive days in expected .. --this is not needed if there is sequential column (with no breaks) in expected which could be used for the recursion --eg rowA has id=1, rowB has id=2, rowC has id=3....rowX has id=100...recursion = recursive id + 1 declare @maxexpdaysdiff int = ( select max(daysdiff) from ( select datediff(day, expd, lead(expd, 1, expd) over(order by expd)) as daysdiff from expected ) as ed ); --recursion, rows of expected with cte as ( --mimimum expected date and any matching payment date select top (1) e.expd, lead(e.expd, 1) over(order by e.expd) as recexpd, pp.payd, pp.payd as lastusedpayd, 1 as rownum from expected as e outer apply(select top (1) p.payd from payment as p where p.payd >= e.expd and p.payd < dateadd(day, 3+1, e.expd) order by p.payd) as pp order by e.expd union all --recursive part (next row of expected and any matching payment which was not assigned as lastusedpayd select cte.recexpd, lexp.leadexpd, mp.payd, isnull(mp.payd, cte.lastusedpayd), cte.rownum+1 from cte --cte carries next expected date::recexpd, this is not needed if there is another/sequential column to be used --next expected row/date for the recursion outer apply ( --get the next expected date, after cte.recexpd to continue the recursion (like using lead() in recursion) select l.leadexpd from ( --rows between cte.recexpd and cte.recexpd+@maxexpdaysdiff (there is always a row..except for the last expected date) select e.expd as leadexpd, row_number() over (order by e.expd) as ern from expected as e where e.expd > cte.recexpd and e.expd <= dateadd(day, @maxexpdaysdiff, cte.recexpd) ) as l where l.ern = 1 ) as lexp --matching payment, if any outer apply ( select lmp.payd from ( --payments between cte.recexpd and cte.recexpd + 3 days --..but exclude all payments before the payment which was lastly used:: cte.lastusedpayd select p.payd, row_number() over(order by p.payd) as prn from payment as p where p.payd >= cte.recexpd and p.payd < dateadd(day, 3+1, cte.recexpd) and (p.payd > cte.lastusedpayd or cte.lastusedpayd is null) ) as lmp where lmp.prn=1 ) as mp where cte.rownum <= @maxrecursion--safeguard recursion. ) select expd, payd --,* from cte option(maxrecursion 0);
EDIT : Another approach (educational), would be to store the payment date which was assigned/selected for a previous row/expected date and use that value when selecting the payment date for each new expected date.
In short:
SELECT expecteddate –> SELECT paymentdate WHERE paymentdate > last_used_payment_date –> SET last_used_payment_date = SELECTed paymentdate (if any)
Although the logic is identical to the recursive approach, it can be implemented without an explicit tsql recursive cte, but just with a normal select and a scalar function.
If you are familiar with (or ever heard of) the “quirky update” then the following could potentially be a “quirky select”.
--the scalar function create or alter function dbo.lastusedpaymentdate(@date date = null) returns date as begin if @date is null begin return (cast(session_context(N'paymentdate') as date)); end exec sp_set_session_context @key=N'paymentdate', @value=@date; return(@date); end go --the query: --reset session variable for each execution <-- this is not really needed for a single execution of the select in a session exec sp_set_session_context @key=N'paymentdate', @value=null; --..a select select e.expd, p.payd, dbo.lastusedpaymentdate(p.payd) as lastusedpaymentdate from expected as e with(index(1)) outer apply ( select top (1) pm.payd from payment as pm where pm.payd >= e.expd and pm.payd < dateadd(day, 3+1, e.expd) and pm.payd > isnull(dbo.lastusedpaymentdate(default), '19000101') order by pm.payd ) as p order by e.expd;
select e.expd from expected
⇛ SELECT expecteddateouter apply(..pm.payd > isnull(dbo.lastusedpaymentdate(default), '19000101'))
⇛ SELECT paymentdate WHERE paymentdate > last_used_payment_dateselect ..dbo.lastusedpaymentdate(p.payd)..
⇛ SET last_used_payment_date = SELECTed paymentdate (if any)
The quirkiness stems from the fact that the base/main table (i.e. expected dates) must be traversed in order, according to the business logic (earliest date first and all other dates next in order). This goes against the SQL fundamentals (declare what, not how) and order of execution cannot “be assumed/taken for granted”.
For the simple example, the query selects with an apply(), the main table has a clustered index to presumably enforce order of retrieval..but still order of execution cannot be taken as ‘always guaranteed’. A more defensive approach would be to throw everything at the query…in an attempt to stop it from deviating from its “expected” behavior (as an exercise in futility) :
............... from expected as e with(index(1), forcescan) outer apply ( .......... ) as p order by e.expd option(force order, maxdop 1, use hint ('DISABLE_BATCH_MODE_ADAPTIVE_JOINS'));
Forcing an execution plan might be the best option if one had to exploit a volatile, ever changing, session context key.
Nevertheless, the focal point of the approach is the simple scalar function, which stores&returns its parameter value in a session key when called with a non-null parameter or it (the function) returns the value of that session key when it’s called with a NULL parameter. For the retrieval of the payment date in the outer apply, the function is called with default/NULL (session key is not set), while in the select part, the function is called with p.payd as parameter. p.payd derives from an outer apply, if p.payd has a value (==there is a matching payment date) then the scalar function sets the session key, if there is no matching p.payd==NULL, then the scalar function simply returns the value of the session key –> synergy among simplistic tsql parts.