Skip to content
Advertisement

Tag: apache-spark-sql

Daily forecast on a PySpark dataframe

I have the following dataframe in PySpark: DT_BORD_REF: Date column for the month REF_DATE: A date reference for current day separating past and future PROD_ID: Product ID COMPANY_CODE: Company ID CUSTOMER_CODE: Customer ID MTD_WD: Month to Date count of working days (Date = DT_BORD_REF) QUANTITY: Number of items sold QTE_MTD: Number of items month to date for DT_BORD_REF < REF_DATE

Get max dates for each customer

Let’s say I have a customer table like so: I want to get 1 row per customer id that has the max(start_date) and if it’s the same date will use the max(created_at). Result should look like this: I’m having a hard time with window functions as I thought a partition by id would work but I have 2 dates. Maybe

Converting query from SQL to pyspark

I am trying to convert the following SQL query into pyspark: The code I have in PySpark right now is this: However, this is simply returning the number of rows in the “data” dataframe, and I know this isn’t correct. I am very new at PySpark, can anyone help me solve this? Answer You need to collect the result into

How to merge rows using SQL only?

I can neither use pyspark or scala. I can only write SQL code. I have a table with 2 columns item id, name. I want to generate results with the names of an item_id concatenated. How do I create such a table with Spark sql? Answer The beauty of Spark SQL is that once you have a solution in any

Advertisement