Cross Join, Compare Values, and Select Closest Match – More Efficient Way?

Question

I have two tables with 2 columns. I cross join and subtract the values. I then find the row_number ordered by the subtraction and choose where row = 1. I'm finding the t2.id that has the closest val to t1.id These tables are quite large. Is the row_number function doing a lot of extra unneeded work by ordering everything after

Accepted Answer

It is possible to run this faster &#8211; it depends on how many rows are in t1, t2, and how much flexibility you have to add indexes etc.As @Chris says, sorting (especially sorting many times) can be a killer. As the cost of sorting increases exponentially (geometrically?) with the amount of values you are sorting, it gets increasingly worse the more you have. If t2 only had two rows, then sorting is easy and your original method would probably be the most efficient. However, if t2 has many rows, then it becomes much much harder. And then if t1 has many rows, and you&#8217;re doing a sort for every row, that also multiplies the cost.As such, for testing purposes, I have used 1,000 rows in each of t1 and t2 below.Below I compare several approaches and indicators of speed and processing. )(Spoiler alert) if you can pre-sort it (like in @Chris&#8217; suggestion) then you can get some big improvements.I don&#8217;t use databricks (sorry) and cannot measure speeds/etc on them. Therefore the below is written and tested in SQL server &#8211; but can be modified to work in databricks pretty easily I would guess. I think the main difference is OUTER APPLY used here &#8211; I believe in Databricks that will be an INNER JOIN LATERAL (e.g., How to use outer apply in Spark sql &#8211; but note I think they got it wrong. OUTER APPLY is equivalent to INNER JOIN LATERAL, while CROSS APPLY is equivalent to LEFT JOIN LATERAL).I created the two tables and filled them with 1,000 rows each.CREATE TABLE #t1 (A_id nvarchar(10) PRIMARY KEY, val decimal(10,8));CREATE TABLE #t2 (B_id nvarchar(10) PRIMARY KEY, val decimal(10,8));Original approach &#8211; sort all rowsYour original query takes very few data reads, but the cost is the amount of sorting it needs to do. Because ROW_NUMBER() sorts all the rows, and then you only take 1, this is your major cost (as @Chris says).-- Original querywith cj as (select   #t1.A_id, #t2.B_id,    row_number() over (partition by #t1.A_id order by abs(#t1.val - #t2.val)) as rw from #t1 cross join #t2)select * from cj where rw = 1;On my computer, this took 1600ms of CPU time.Approach 2 &#8211; taking the MIN() valueHowever, as you only need the minimum row, there is no need to really sort the other rows. Taking a &#8216;min&#8217; only requires a scan through the data once for each data point in t1, and pick the smallest value.However, once you have the smallest value, you then need to refer to t2 again to get the relevant t2 IDs.In other words, the logic of this isspend less time determining only the smallest absolute difference (instead of sorting them all)spend more reads and more time finding which value(s) of t2 get that absolute difference-- Using MIN() to find smallest differencewith cj as (select   #t1.A_id, #t1.val,  MIN(abs(#t1.val - #t2.val)) AS minvaldif from #t1 cross join #t2 GROUP BY #t1.A_id, #t1.val )select cj.A_ID,    #t2.B_idFROM    cj        CROSS JOIN #t2WHERE   abs(cj.val - #t2.val) = minvaldif;This took my computer about half the time of the original &#8211; about 800ms of computer time &#8211; but more than doubles the amount of data reads it does. Note also that it can return several rows if (say) there are repeats of values in t2.Approach 3 &#8211; lateral joinIn this case, you do a lateral join (in SQL Server, it&#8217;s an &#8216;OUTER APPLY&#8217;) to select just the 1 minimum value you need. Similar to above, you find the min value, but you do it individually for each row in t1. Therefore you need to do 1000 &#8216;min&#8217; values rather than 1000 sorts.-- Lateral join with min differenceSELECT  #t1.A_id, t2_calc.B_id    FROM    #t1            OUTER APPLY                 (SELECT TOP (1) #t2.B_Id                FROM #T2                ORDER BY abs(#t1.val - #t2.val)                ) AS t2_calc;This is the most efficient so far &#8211; with few reads and only 300ms of compute time. If you cannot add indexes, this is probably the best you could do.Option 4 &#8211; pre-sort the data with an indexIf you can pre-sort the data using an index, then you can boost your efficiency by a lot.CREATE NONCLUSTERED INDEX #IX_t2_val ON #t2 (val);The &#8216;gotcha&#8217; is that even if you have an index on t2.val, databases will have a problem with min(abs(t1.val - t2.val)) &#8211; they will usually still need to read all the data rather than use the index.However, you can use the logic you identified in your question &#8211; that min(abs(difference)) value is the one where t1.val is closest to t2.val.In the method belowFor every t1.val, find the closest t2 row that is less than orequal to it, but not overAlso find, for every t1.val, the closest t2 row that is above itThen using your logic in the original answer, find the one of these that is the closest.This also uses lateral views-- Using indexeswith cj as         (SELECT #t1.A_id, #t1.val AS A_val, t2_lessthan.B_id, t2_lessthan.val AS B_val            FROM    #t1                    CROSS APPLY                         (SELECT TOP (1) #t2.B_Id, #t2.val                        FROM #T2                        WHERE #t2.val <= #t1.val                        ORDER BY #t2.val DESC                        ) AS t2_lessthan        UNION ALL        SELECT  #t1.A_id, #t1.val AS A_val, t2_greaterthan.B_id, t2_greaterthan.val AS B_val            FROM    #t1                    CROSS APPLY                         (SELECT TOP (1) #t2.B_Id, #t2.val                        FROM #T2                        WHERE #t2.val > #t1.val                        ORDER BY #t2.val                        ) AS t2_greaterthan             ),    cj_rn AS        (SELECT A_id, B_id,                row_number() over (partition by A_id order by abs(A_val - B_val)) as rw            FROM cj        )    select * from cj_rn where rw = 1;Compute time: 4ms.For each value in t1, it simply does 2 index seeks in t2, and &#8216;sorts&#8217; the two value (which is very easy). As such, in this case, it is orders of magnitude faster.So&#8230; really the best approach is if you can pre-sort the data (in this case by adding indexes) and then make sure you take advantage of that sort.

id	val
A1	0.123456
A2	1.123456
A3	-0.712345

id	val
B1	0.065432
B2	1.654321
B3	-0.654321

Advertisement

Answer