Skip to content
Advertisement

Is there a way to compare all rows in one column of a dataframe against all rows in another column of another dataframe (spark)?

I have two dataframes in Spark, both with an IP column. One column has over 800000 entries while the other has 4000 entries. What I want to do is to see if the IP’s in the smaller dataframe appear in the IP column of the large dataframe.

At the moment all I can manage is to compare the first row of the columns, second rows of the columns etc.

Thanks in advance!

Advertisement

Answer

How about something like this, straight from the manuals using EXISTS or NOT EXISTS, having created tempviews, see here https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html:

%sql
SELECT  *
FROM    t1 A
WHERE   NOT EXISTS (SELECT  1
                      FROM  t2 B
                     WHERE  B.colx = A.colx)
Advertisement