I want to filter a Pyspark DataFrame with a SQL-like IN
clause, as in
sc = SparkContext() sqlc = SQLContext(sc) df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where a
is the tuple (1, 2, 3)
. I am getting this error:
java.lang.RuntimeException: [1.67] failure: “(” expected but identifier a found
which is basically saying it was expecting something like ‘(1, 2, 3)’ instead of a. The problem is I can’t manually write the values in a as it’s extracted from another job.
How would I filter in this case?
Advertisement
Answer
String you pass to SQLContext
it evaluated in the scope of the SQL environment. It doesn’t capture the closure. If you want to pass a variable you’ll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v")) df.registerTempTable("df") sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count() ## 2
Obviously this is not something you would use in a “real” SQL environment due to security considerations but it shouldn’t matter here.
In practice DataFrame
DSL is a much choice when you want to create dynamic queries:
from pyspark.sql.functions import col df.where(col("v").isin({"foo", "bar"})).count() ## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.