Is there a way to compare all rows in one column of a dataframe against all rows in another column of another dataframe (spark)?

Question

I have two dataframes in Spark, both with an IP column. One column has over 800000 entries while the other has 4000 entries. What I want to do is to see if the IP&#8217;s in the smaller dataframe appear in the IP column of the large dataframe. At the moment all I can manage is to compare the first row

Accepted Answer

How about something like this, straight from the manuals using EXISTS or NOT EXISTS, having created tempviews, see here https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html:%sqlSELECT  *FROM    t1 AWHERE   NOT EXISTS (SELECT  1                      FROM  t2 B                     WHERE  B.colx = A.colx)

Advertisement

Answer