Tag: apache-spark-sql

Daily forecast on a PySpark dataframe

apache-spark apache-spark-sql pyspark sql window-functions

I have the following dataframe in PySpark: DT_BORD_REF: Date column for the month REF_DATE: A date reference for current day separating past and future PROD_ID: Product ID COMPANY_CODE: Company ID CUSTOMER_CODE: Customer ID MTD_WD: Month to Date count of working days (Date = DT_BORD_REF) QUANTITY: Number of i…

Get max dates for each customer

apache-spark-sql sql

Let’s say I have a customer table like so: I want to get 1 row per customer id that has the max(start_date) and if it’s the same date will use the max(created_at). Result should look like this: I’m having a hard time with window functions as I thought a partition by id would work but I have …

SparkSQLContext dataframe Select query based on column array

apache-spark apache-spark-sql dataframe scala sql

This is my dataframe: I want to select all books where the author is Udo Haiber. but of course it didn’t work because authors is array. Answer You can use array_contains to check if the author is inside the array: Use single quotes to quote the author name because you’re using double quotes for th…

How I can select a column where in another column I need a specific things

apache-spark apache-spark-sql dataframe pyspark sql

I have a pyspark data frame. How I can select a column where in another column I need a specific things. suppose I have n columns. for 2 columns I have A. B. a b a c d f I want all column B. …

Converting query from SQL to pyspark

apache-spark apache-spark-sql pyspark sql

I am trying to convert the following SQL query into pyspark: The code I have in PySpark right now is this: However, this is simply returning the number of rows in the “data” dataframe, and I know this isn’t correct. I am very new at PySpark, can anyone help me solve this? Answer You need to …

How can I compare rows of data in an array based on distinct attributes of a column?

apache-spark apache-spark-sql scala sql

I have a tricky student work in spark. I need to write an SQL query for this kind of array: There are more departments and accordingly loans for each department both for males and females. How can I compute a new array where Female’s loans are more than Male’s loans per department and print/show only the depa…

How to merge rows using SQL only?

apache-spark apache-spark-sql sql

I can neither use pyspark or scala. I can only write SQL code. I have a table with 2 columns item id, name. I want to generate results with the names of an item_id concatenated. How do I create such a table with Spark sql? Answer The beauty of Spark SQL is that once you have a solution in any

How do I specify a default value when the value is “null” in a spark dataframe?

apache-spark apache-spark-sql pyspark sql

I have a data frame like the picture below. In the case of “null” among the values of the “item_param” column, I want to replace the string’test’. How can I do it? df = sv_df….

SQL Spark – Lag vs first row by Group

apache-spark-sql sql window-functions

I’m SQL newbie and I’m trying to calculate difference between the averages. I want for each item and year calculate difference between months, but I want always substract current average – fist month …

Sum of column returning all null values in PySpark SQL

apache-spark-sql data-science pyspark pyspark-dataframes sql

I am new to Spark and this might be a straightforward problem. I’ve a SQL with name sql_left which is in the format: Here is a sample data generated using sql_left.take(1): Note: Age column has ‘XXX’,’NUll’ and other integer values as 023,034 etc. The printSchema shows Age,Total …