Tag: pyspark

similar to groupByKey() in Spark but using SQL queries

I trying to make into using only SQL queries. It is kind of similar to using groupByKey() in pyspark. Is there a way to do this? Answer Just use conditional aggregation. One method is: In Postgres, this would be phrased using the standard filter clause:

Daily forecast on a PySpark dataframe

apache-spark apache-spark-sql pyspark sql window-functions

I have the following dataframe in PySpark: DT_BORD_REF: Date column for the month REF_DATE: A date reference for current day separating past and future PROD_ID: Product ID COMPANY_CODE: Company ID CUSTOMER_CODE: Customer ID MTD_WD: Month to Date count of working days (Date = DT_BORD_REF) QUANTITY: Number of i…

How I can select a column where in another column I need a specific things

apache-spark apache-spark-sql dataframe pyspark sql

I have a pyspark data frame. How I can select a column where in another column I need a specific things. suppose I have n columns. for 2 columns I have A. B. a b a c d f I want all column B. …

Converting query from SQL to pyspark

apache-spark apache-spark-sql pyspark sql

I am trying to convert the following SQL query into pyspark: The code I have in PySpark right now is this: However, this is simply returning the number of rows in the “data” dataframe, and I know this isn’t correct. I am very new at PySpark, can anyone help me solve this? Answer You need to …

How to deal with ambiguous column reference in sql column name reference?

azure-databricks databricks join pyspark sql

I have some code: I then try but I get an error Error in SQL statement: AnalysisException: Reference ‘A.CDE_WR’ is ambiguous, could be: A.CDE_WR, A.CDE_WR.; line 6 pos 4 in databricks. How can I deal with this? Answer This query: is using SELECT *. The * is shorthand for all columns from both tabl…

How do I specify a default value when the value is “null” in a spark dataframe?

apache-spark apache-spark-sql pyspark sql

I have a data frame like the picture below. In the case of “null” among the values of the “item_param” column, I want to replace the string’test’. How can I do it? df = sv_df….

How to add a ranking to a pyspark dataframe

pyspark python sql

I have a pyspark dataframe with 2 columns – id and count. I want to add a ranking to this by reverse count. So the highest count has rank 1, second highest rank 2, etc. testDF = spark.createDataFrame([(DJS232,437232)], [“id”, “count”]) I first tried using and this worked, ish. It…

how to join two hive tables with embedded array of struct and array on pyspark

dataframe hive pyspark python sql

I am trying to join two hive tables on databricks. tab1: The schema of “some_questions” “some_questions” example: tab2: I need to join tab1 and tab2 by “question_id” such that I get a new table I try to join them by pyspark. But, I am not sure how to decompose the array wit…

Sum of column returning all null values in PySpark SQL

apache-spark-sql data-science pyspark pyspark-dataframes sql

I am new to Spark and this might be a straightforward problem. I’ve a SQL with name sql_left which is in the format: Here is a sample data generated using sql_left.take(1): Note: Age column has ‘XXX’,’NUll’ and other integer values as 023,034 etc. The printSchema shows Age,Total …