Skip to content
Advertisement

How can I write an SQL query as a template in PySpark?

I want to write a function that takes a column, a dataframe containing that column and a query template as arguments that outputs the result of the query when run on the column.

Something like: func_sql(df_tbl,’age’,’select count(distinct {col}) from df_tbl’)

Here, {col} should get replace with ‘age’ and output should be the result of the query run on ‘age’, i.e. count of distinct values in ‘age’.

Advertisement

Answer

A bit confusing part is weather you want to parameterize query or table or not but assuming you want to pass query dataframe and column as parameter and just want first result as return of function

The above function can be called in below way to get distinct values of age from df_tbl

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement