Skip to content
Advertisement

How can I write an SQL query as a template in PySpark?

I want to write a function that takes a column, a dataframe containing that column and a query template as arguments that outputs the result of the query when run on the column.

Something like: func_sql(df_tbl,’age’,’select count(distinct {col}) from df_tbl’)

Here, {col} should get replace with ‘age’ and output should be the result of the query run on ‘age’, i.e. count of distinct values in ‘age’.

Answer

A bit confusing part is weather you want to parameterize query or table or not but assuming you want to pass query dataframe and column as parameter and just want first result as return of function

def func_sql(df_tbl,col,query): #function definition
  print(eval(f"f'{query}'")) # printing evaluated query by replacing parameters
  q=spark.sql(eval(f"f'{query}'")).collect()[0][0] # getting first result
  return q # returning the result

The above function can be called in below way to get distinct values of age from df_tbl

print(func_sql('df_tbl','age','select count(distinct {col}) from {df_tbl}'))