Skip to content
Advertisement

PySpark: Adding elements from python list into spark.sql() statement

have list in python that is used throughout my code:

pylist = ['A', 'B', 'C', 'D']

I also have have a simple spark.sql() line that I need to execute:

query = spark.sql(
    """
    SELECT col1, col2, col3
    FROM database.table
    WHERE col3 IN ('A', 'B', 'C', 'D')
    """
)

I want to replace the list of elements in the spark.sql() statment with the python list so that that last line in the SQL is

...
AND col3 IN pylist

I am aware of using {} and str.format but I am struggling to understand if that’s the correct option here, and how that workd.

Answer

Okay, after some more research, I think I found the solution using .format(tuple(pylist))

The statement now is:

pylist = ['A', 'B', 'C', 'D']

s = """
    SELECT col1, col2, col3
    FROM database.table
    WHERE col3 IN {}
    """.format(tuple(pylist))

query = spark.sql(s)