I want to write a function that takes a column, a dataframe containing that column and a query template as arguments that outputs the result of the query when run on the column.
Something like: func_sql(df_tbl,’age’,’select count(distinct {col}) from df_tbl’)
Here, {col} should get replace with ‘age’ and output should be the result of the query run on ‘age’, i.e. count of distinct values in ‘age’.
Advertisement
Answer
A bit confusing part is weather you want to parameterize query or table or not but assuming you want to pass query dataframe and column as parameter and just want first result as return of function
def func_sql(df_tbl,col,query): #function definition print(eval(f"f'{query}'")) # printing evaluated query by replacing parameters q=spark.sql(eval(f"f'{query}'")).collect()[0][0] # getting first result return q # returning the result
The above function can be called in below way to get distinct values of age from df_tbl
print(func_sql('df_tbl','age','select count(distinct {col}) from {df_tbl}'))