Skip to content

mismatched input error when trying to use Spark subquery

New at PySpark, trying to get a query to run and it seems like it SHOULD run but I get an EOF issue and I’m not sure how to resolve it..

What I’m trying to do is find all rows in blah.table where the value in col “domainname” matches a value from a list of domains. Then I want to grab 2 columns from those rows that contain ID information, and do another search finding all rows in blah.table that contain those pairs of IDs.

So far I have:

df = spark.sql("select * from blah.table where id1,id2 in (select id1,id2 from blah.table where domainname in ('list.com','of.com','domains.com'))")

When I run it I get this error:

mismatched input ',' expecting {<EOF>, ';'}

If I split the query up, this seems to run fine by itself:

df = spark.sql("select id1,id2 from blah.table where domainname in ('list.com','of.com','domains.com')")

How can I make this work?

Answer

select * from blah.table
where (id1, id2) in (select id1,id2 from blah.table
                     where domainname in ('list.com','of.com','domains.com'))

Or, use EXISTS:

select * from blah.table t1
where EXISTS (select * from blah.table t2
              where t1.id1 = t2.id1 and t1.id2 = t2.id2
                and t2.domainname in ('list.com','of.com','domains.com'))