Filtering a Pyspark DataFrame with SQL-like IN clause

Question

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in sc = SparkContext() sqlc = SQLContext(sc) df = sqlc.sql(&#8216;SELECT * from my_df WHERE field1 IN a&#8217;) where a is the tuple (1, 2, 3). &#8230;

Accepted Answer

String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn&#8217;t capture the closure. If you want to pass a variable you&#8217;ll have to do it explicitly using string formatting:df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))df.registerTempTable("df")sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()##  2 Obviously this is not something you would use in a &#8220;real&#8221; SQL environment due to security considerations but it shouldn&#8217;t matter here.In practice DataFrame DSL is a much choice when you want to create dynamic queries:from pyspark.sql.functions import coldf.where(col("v").isin({"foo", "bar"})).count()## 2It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

Advertisement

Answer