Skip to content
Advertisement

Full Join using a fuzzy join but with distinct matching

We have a table of expected payments and a table of payments made. We need to be able to match payments with a singular expected payment but we allow a +-3 day window for it to be made. On top of that, it should be a one-to-one match.

So imagine I have a table of expected payments

2020-10-01
2020-10-04
2020-10-05
2020-10-20

and payments

2020-10-02
2020-10-06
2020-10-07

The result I want is

Expected      Made
2020-10-01    2020-10-02
2020-10-04    2020-10-06
2020-10-05    2020-10-07
2020-10-20

and if the payment of the 6th is removed the result would be

Expected      Made
2020-10-01    2020-10-02
2020-10-04    2020-10-07
2020-10-05
2020-10-20

so the matching between the 5th and the payment on the 7th depends on whether the payment matched with the 4th. As does the matching between the 4th and 7th depend on if the 4th was matched with the 6th.

I’ve currently achieved this by doing a full join with the matching and then recursively iterating over it to clean out repeated records from both sides. unfortunately as the data in this case is in the 100s of millions of rows, it takes about 40 minutes to churn.

I’m wondering if there is a better way or a built in join that I’ve not come across to achieve this concept of distinct matching.

Advertisement

Answer

..the same logic, as your current approach, iteration through expected dates and find any matching payment date which is greater than the last used payment date. It is a contrived example (based on the simplified requirement) and it might not fit well to the complete/full specs (lots of intricacies are hidden underneath the uniqueness of the dates.. do not get tricked just because the example executes fast).

/*
drop table if exists expected
go
drop table if exists payment
go
*/


--expected
create table expected(expd date primary key clustered)
go

declare @d date = isnull((select max(expd) from expected), '20100101')
begin try
    insert into expected(expd)
    select dateadd(day, abs(checksum(newid()))%5, @d)
end try
begin catch
    --ignore errors, .... just a couple sample rows
end catch
go 5000
----------------

--payment
create table payment(payd date primary key clustered)
go

declare @d date = isnull((select max(payd) from payment), '20100101')
begin try
    insert into payment(payd)
    select dateadd(day, abs(checksum(newid()))%10, @d)
end try
begin catch
    --ignore errors, .... just a couple sample rows
end catch
go 5000



--max recursion == number of rows in expected - 1 (1 is the anchor)
declare @maxrecursion int = (select count(*) from expected)-1 /* -1==anchor*/;

--maximum difference between consecutive days in expected ..
--this is not needed if there is sequential column (with no breaks) in expected which could be used for the recursion
--eg rowA has id=1, rowB has id=2, rowC has id=3....rowX has id=100...recursion = recursive id + 1

declare @maxexpdaysdiff int = (
    select max(daysdiff)
    from 
    ( 
        select datediff(day, expd, lead(expd, 1, expd) over(order by expd)) as daysdiff
        from expected
    ) as ed
);



--recursion, rows of expected
with cte
as
(
    --mimimum expected date and any matching payment date
    select top (1) e.expd, lead(e.expd, 1) over(order by e.expd) as recexpd, pp.payd, pp.payd as lastusedpayd, 1 as rownum
    from expected as e
    outer apply(select top (1) p.payd from payment as p where p.payd >= e.expd and p.payd < dateadd(day, 3+1, e.expd) order by p.payd) as pp
    order by e.expd 
    union all
    --recursive part (next row of expected and any matching payment which was not assigned as lastusedpayd
    select cte.recexpd, lexp.leadexpd, mp.payd, isnull(mp.payd, cte.lastusedpayd), cte.rownum+1
    from cte --cte carries next expected date::recexpd, this is not needed if there is another/sequential column to be used
    --next expected row/date for the recursion
    outer apply
    (
        --get the next expected date, after cte.recexpd to continue the recursion (like using lead() in recursion)
        select l.leadexpd
        from
        (
            --rows between cte.recexpd and cte.recexpd+@maxexpdaysdiff (there is always a row..except for the last expected date)
            select e.expd as leadexpd, row_number() over (order by e.expd) as ern
            from expected as e
            where e.expd > cte.recexpd
            and e.expd <= dateadd(day, @maxexpdaysdiff, cte.recexpd)
        ) as l 
        where l.ern = 1
    ) as lexp
    --matching payment, if any
    outer apply
    ( 
        select lmp.payd
        from
        (
            --payments between cte.recexpd and cte.recexpd + 3 days
            --..but exclude all payments before the payment which was lastly used:: cte.lastusedpayd
            select p.payd, row_number() over(order by p.payd) as prn
            from payment as p
            where p.payd >= cte.recexpd
            and p.payd < dateadd(day, 3+1, cte.recexpd)
            and (p.payd > cte.lastusedpayd or cte.lastusedpayd is null)
        
        ) as lmp
        where lmp.prn=1
    ) as mp
    where cte.rownum <=  @maxrecursion--safeguard recursion.    
)
select expd, payd --,*
from cte
option(maxrecursion 0);

EDIT : Another approach (educational), would be to store the payment date which was assigned/selected for a previous row/expected date and use that value when selecting the payment date for each new expected date.

In short:

SELECT expecteddate –> SELECT paymentdate WHERE paymentdate > last_used_payment_date –> SET last_used_payment_date = SELECTed paymentdate (if any)

Although the logic is identical to the recursive approach, it can be implemented without an explicit tsql recursive cte, but just with a normal select and a scalar function.

If you are familiar with (or ever heard of) the “quirky update” then the following could potentially be a “quirky select”.

--the scalar function
create or alter function dbo.lastusedpaymentdate(@date date = null)
returns date
as
begin
    if @date is null
    begin
        return (cast(session_context(N'paymentdate') as date)); 
    end

    exec sp_set_session_context @key=N'paymentdate', @value=@date;
    return(@date);
end
go

--the query:
--reset session variable for each execution <-- this is not really needed for a single execution of the select in a session
exec sp_set_session_context @key=N'paymentdate', @value=null;
--..a select
select e.expd, p.payd, dbo.lastusedpaymentdate(p.payd) as lastusedpaymentdate
from expected as e with(index(1))
outer apply
(
    select top (1) pm.payd
    from payment as pm
    where pm.payd >= e.expd
    and pm.payd < dateadd(day, 3+1, e.expd)
    and pm.payd > isnull(dbo.lastusedpaymentdate(default), '19000101')
    order by pm.payd
) as p
order by e.expd;
  1. select e.expd from expected ⇛ SELECT expecteddate
  2. outer apply(..pm.payd > isnull(dbo.lastusedpaymentdate(default), '19000101')) ⇛ SELECT paymentdate WHERE paymentdate > last_used_payment_date
  3. select ..dbo.lastusedpaymentdate(p.payd).. ⇛ SET last_used_payment_date = SELECTed paymentdate (if any)

The quirkiness stems from the fact that the base/main table (i.e. expected dates) must be traversed in order, according to the business logic (earliest date first and all other dates next in order). This goes against the SQL fundamentals (declare what, not how) and order of execution cannot “be assumed/taken for granted”.

For the simple example, the query selects with an apply(), the main table has a clustered index to presumably enforce order of retrieval..but still order of execution cannot be taken as ‘always guaranteed’. A more defensive approach would be to throw everything at the query…in an attempt to stop it from deviating from its “expected” behavior (as an exercise in futility) :

...............
from expected as e with(index(1), forcescan)
outer apply
(
..........
) as p
order by e.expd
option(force order, maxdop 1, use hint ('DISABLE_BATCH_MODE_ADAPTIVE_JOINS'));

Forcing an execution plan might be the best option if one had to exploit a volatile, ever changing, session context key.

Nevertheless, the focal point of the approach is the simple scalar function, which stores&returns its parameter value in a session key when called with a non-null parameter or it (the function) returns the value of that session key when it’s called with a NULL parameter. For the retrieval of the payment date in the outer apply, the function is called with default/NULL (session key is not set), while in the select part, the function is called with p.payd as parameter. p.payd derives from an outer apply, if p.payd has a value (==there is a matching payment date) then the scalar function sets the session key, if there is no matching p.payd==NULL, then the scalar function simply returns the value of the session key –> synergy among simplistic tsql parts.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement