Skip to content

Tag: apache-spark-sql

Get max dates for each customer

Let’s say I have a customer table like so: I want to get 1 row per customer id that has the max(start_date) and if it’s the same date will use the max(created_at). Result should look like this: I’m having a hard time with window functions as I thought a partition by id would work but I have …

Converting query from SQL to pyspark

I am trying to convert the following SQL query into pyspark: The code I have in PySpark right now is this: However, this is simply returning the number of rows in the “data” dataframe, and I know this isn’t correct. I am very new at PySpark, can anyone help me solve this? Answer You need to …

How to merge rows using SQL only?

I can neither use pyspark or scala. I can only write SQL code. I have a table with 2 columns item id, name. I want to generate results with the names of an item_id concatenated. How do I create such a table with Spark sql? Answer The beauty of Spark SQL is that once you have a solution in any