Tag: apache-spark

Daily forecast on a PySpark dataframe

apache-spark apache-spark-sql pyspark sql window-functions

I have the following dataframe in PySpark: DT_BORD_REF: Date column for the month REF_DATE: A date reference for current day separating past and future PROD_ID: Product ID COMPANY_CODE: Company ID CUSTOMER_CODE: Customer ID MTD_WD: Month to Date count of working days (Date = DT_BORD_REF) QUANTITY: Number of i…

SparkSQLContext dataframe Select query based on column array

apache-spark apache-spark-sql dataframe scala sql

This is my dataframe: I want to select all books where the author is Udo Haiber. but of course it didn’t work because authors is array. Answer You can use array_contains to check if the author is inside the array: Use single quotes to quote the author name because you’re using double quotes for th…

How I can select a column where in another column I need a specific things

apache-spark apache-spark-sql dataframe pyspark sql

I have a pyspark data frame. How I can select a column where in another column I need a specific things. suppose I have n columns. for 2 columns I have A. B. a b a c d f I want all column B. …

Converting query from SQL to pyspark

apache-spark apache-spark-sql pyspark sql

I am trying to convert the following SQL query into pyspark: The code I have in PySpark right now is this: However, this is simply returning the number of rows in the “data” dataframe, and I know this isn’t correct. I am very new at PySpark, can anyone help me solve this? Answer You need to …

How can I compare rows of data in an array based on distinct attributes of a column?

apache-spark apache-spark-sql scala sql

I have a tricky student work in spark. I need to write an SQL query for this kind of array: There are more departments and accordingly loans for each department both for males and females. How can I compute a new array where Female’s loans are more than Male’s loans per department and print/show only the depa…

Finding largest number of location IDs per hour from each zone

apache-spark scala sql

I am using scala with spark and having a hard time understanding how to calculate the maximum count of pickups from a location corresponding to each hour. Currently I have a df with three columns (Location,hour,Zone) where Location is an integer, hour is an integer 0-23 signifying the hour of the day and Zone…

How to merge rows using SQL only?

apache-spark apache-spark-sql sql

I can neither use pyspark or scala. I can only write SQL code. I have a table with 2 columns item id, name. I want to generate results with the names of an item_id concatenated. How do I create such a table with Spark sql? Answer The beauty of Spark SQL is that once you have a solution in any

How do I specify a default value when the value is “null” in a spark dataframe?

apache-spark apache-spark-sql pyspark sql

I have a data frame like the picture below. In the case of “null” among the values of the “item_param” column, I want to replace the string’test’. How can I do it? df = sv_df….

Pivot rows in a different way using MySQL or SparkDataframe

apache-spark apache-spark-sql sql

I have a table like this,I am doing normal pivoting,which is not giving the desired result. I want to get it in this way: I tried doing it like this : But its not giving as expected. Can any one suggest what can be done for the desired result ? Normal SQL / DF solution both will be helpful. Answer

doing some of columns based on some complex logic in pyspark

apache-spark apache-spark-sql pandas pyspark sql

Here is the question in the image attached: Table: So result column is calculated based on the below rules: If col3 >0 , then result=col1+col2 If col 3=0, then result= sum (col2) till col3 >0 + col1(where col3>0) for example for row =3, the result=60+70+80+30(from col1 from row 5 because here col3&gt…