I have the following dataframe in PySpark: DT_BORD_REF: Date column for the month REF_DATE: A date reference for current day separating past and future PROD_ID: Product ID COMPANY_CODE: Company ID CUSTOMER_CODE: Customer ID MTD_WD: Month to Date count of working days (Date = DT_BORD_REF) QUANTITY: Number of items sold QTE_MTD: Number of items month to date for DT_BORD_REF < REF_DATE
Tag: apache-spark-sql
Get max dates for each customer
Let’s say I have a customer table like so: I want to get 1 row per customer id that has the max(start_date) and if it’s the same date will use the max(created_at). Result should look like this: I’m having a hard time with window functions as I thought a partition by id would work but I have 2 dates. Maybe
SparkSQLContext dataframe Select query based on column array
This is my dataframe: I want to select all books where the author is Udo Haiber. but of course it didn’t work because authors is array. Answer You can use array_contains to check if the author is inside the array: Use single quotes to quote the author name because you’re using double quotes for the query string.
How I can select a column where in another column I need a specific things
I have a pyspark data frame. How I can select a column where in another column I need a specific things. suppose I have n columns. for 2 columns I have A. B. a b a c d f I want all column B. …
Converting query from SQL to pyspark
I am trying to convert the following SQL query into pyspark: The code I have in PySpark right now is this: However, this is simply returning the number of rows in the “data” dataframe, and I know this isn’t correct. I am very new at PySpark, can anyone help me solve this? Answer You need to collect the result into
How can I compare rows of data in an array based on distinct attributes of a column?
I have a tricky student work in spark. I need to write an SQL query for this kind of array: There are more departments and accordingly loans for each department both for males and females. How can I compute a new array where Female’s loans are more than Male’s loans per department and print/show only the departments where female loans
How to merge rows using SQL only?
I can neither use pyspark or scala. I can only write SQL code. I have a table with 2 columns item id, name. I want to generate results with the names of an item_id concatenated. How do I create such a table with Spark sql? Answer The beauty of Spark SQL is that once you have a solution in any
How do I specify a default value when the value is “null” in a spark dataframe?
I have a data frame like the picture below. In the case of “null” among the values of the “item_param” column, I want to replace the string’test’. How can I do it? df = sv_df….
SQL Spark – Lag vs first row by Group
I’m SQL newbie and I’m trying to calculate difference between the averages. I want for each item and year calculate difference between months, but I want always substract current average – fist month …
Sum of column returning all null values in PySpark SQL
I am new to Spark and this might be a straightforward problem. I’ve a SQL with name sql_left which is in the format: Here is a sample data generated using sql_left.take(1): Note: Age column has ‘XXX’,’NUll’ and other integer values as 023,034 etc. The printSchema shows Age,Total Cas as integers. I’ve tried the below code to first join two tables: