I have the following dataframe in PySpark: DT_BORD_REF: Date column for the month REF_DATE: A date reference for current day separating past and future PROD_ID: Product ID COMPANY_CODE: Company ID CUSTOMER_CODE: Customer ID MTD_WD: Month to Date count of working days (Date = DT_BORD_REF) QUANTITY: Number of items sold QTE_MTD: Number of items month to date for DT_BORD_REF < REF_DATE
Tag: apache-spark
SparkSQLContext dataframe Select query based on column array
This is my dataframe: I want to select all books where the author is Udo Haiber. but of course it didn’t work because authors is array. Answer You can use array_contains to check if the author is inside the array: Use single quotes to quote the author name because you’re using double quotes for the query string.
How I can select a column where in another column I need a specific things
I have a pyspark data frame. How I can select a column where in another column I need a specific things. suppose I have n columns. for 2 columns I have A. B. a b a c d f I want all column B. …
Converting query from SQL to pyspark
I am trying to convert the following SQL query into pyspark: The code I have in PySpark right now is this: However, this is simply returning the number of rows in the “data” dataframe, and I know this isn’t correct. I am very new at PySpark, can anyone help me solve this? Answer You need to collect the result into
How can I compare rows of data in an array based on distinct attributes of a column?
I have a tricky student work in spark. I need to write an SQL query for this kind of array: There are more departments and accordingly loans for each department both for males and females. How can I compute a new array where Female’s loans are more than Male’s loans per department and print/show only the departments where female loans
Finding largest number of location IDs per hour from each zone
I am using scala with spark and having a hard time understanding how to calculate the maximum count of pickups from a location corresponding to each hour. Currently I have a df with three columns (Location,hour,Zone) where Location is an integer, hour is an integer 0-23 signifying the hour of the day and Zone is a string. Something like this
How to merge rows using SQL only?
I can neither use pyspark or scala. I can only write SQL code. I have a table with 2 columns item id, name. I want to generate results with the names of an item_id concatenated. How do I create such a table with Spark sql? Answer The beauty of Spark SQL is that once you have a solution in any
How do I specify a default value when the value is “null” in a spark dataframe?
I have a data frame like the picture below. In the case of “null” among the values of the “item_param” column, I want to replace the string’test’. How can I do it? df = sv_df….
Pivot rows in a different way using MySQL or SparkDataframe
I have a table like this,I am doing normal pivoting,which is not giving the desired result. I want to get it in this way: I tried doing it like this : But its not giving as expected. Can any one suggest what can be done for the desired result ? Normal SQL / DF solution both will be helpful. Answer
doing some of columns based on some complex logic in pyspark
Here is the question in the image attached: Table: So result column is calculated based on the below rules: If col3 >0 , then result=col1+col2 If col 3=0, then result= sum (col2) till col3 >0 + col1(where col3>0) for example for row =3, the result=60+70+80+30(from col1 from row 5 because here col3>0)=240 for row=4, the result=70+80+30(from col1 from row 5