Skip to content

Tag: pyspark

How can I write an SQL query as a template in PySpark?

I want to write a function that takes a column, a dataframe containing that column and a query template as arguments that outputs the result of the query when run on the column. Something like: func_sql(df_tbl,’age’,’select count(distinct {col}) from df_tbl’) Here, {col} should get replace with ‘age’ and output should be the result of the query run on ‘age’, i.e.

Spark.sql Filter rows by MAX

Below is part of a source file which you could imagine being much bigger: After the following code: I would like to obtain this result: The aim is to: Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates), Sort by total descending, then date

Spark: How to transpose and explode columns with dynamic nested arrays

I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays. I have added to the dataframe “””{“id”:3,”c”:[{“date”:3,”val”:3, “val_dynamic”:3}]}}””” , with new column c, where array has new val_dynamic field which can appear on random basis. I’m looking for required output 2 (Transpose and

is there a method to conect to postgresql (dbeaver ) from pyspark?

hello i installed pyspark now and i have a database postgres in local in DBeaver : how can i connect to postgres from pyspark please i tried this but i have an error Answer You need to add the jars you want to use when creating the sparkSession. See this : https://spark.apache.org/docs/2.4.7/submitting-applications.html#advanced-dependency-management Either when you start pyspark or when you