I have two dataframes in Spark, both with an IP column. One column has over 800000 entries while the other has 4000 entries. What I want to do is to see if the IP’s in the smaller dataframe appear in the IP column of the large dataframe.
At the moment all I can manage is to compare the first row of the columns, second rows of the columns etc.
Thanks in advance!
Advertisement
Answer
How about something like this, straight from the manuals using EXISTS or NOT EXISTS, having created tempviews, see here https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html:
%sql SELECT * FROM t1 A WHERE NOT EXISTS (SELECT 1 FROM t2 B WHERE B.colx = A.colx)