I have a DataFrame called product_relationship_current and I’m doing a self-join to retrieve a new DataFrame like bellow: First I’m giving it an alias so I could consider them like two different dataframes: And then I’m doing a self-join to get a new dataframe: But I’m looking for another way to do that without doing a self-join, so I don’t
Tag: apache-spark
How to aggregate on multiple columns using SQL or spark SQL
I have following table: Expected output is: The aggregation computation involves 2 columns, is this supported in SQL? Answer In Spark SQL you can do it like this: or in one select: Higher-order aggregate function is used in this example. aggregate(expr, start, merge, finish) – Applies a binary operator to an initial state and all elements in the array, and
filter stop words from text column – spark SQL
I’m using spark SQL and have a data frame with user IDs & reviews of products. I need to filter stop words from the reviews, and I have a text file with stop words to filter. I managed to split the reviews to lists of strings, but don’t know how to filter. this is what I tried to do: thanks!
sql – how to join on a column that is less than another join key
I have two tables as below. What I’m trying to do is to join A and B base on date and id, to get the value from B. The problem is, I want to join using add_month(A.Date, -1) = B.month (find the data in table B from one month earlier). If that’s not available, I want to join using two
PySpark: Adding elements from python list into spark.sql() statement
have list in python that is used throughout my code: I also have have a simple spark.sql() line that I need to execute: I want to replace the list of elements in the spark.sql() statment with the python list so that that last line in the SQL is I am aware of using {} and str.format but I am struggling
Filter a Dataframe using a subset of it and two specific fields in spark/scala [closed]
Closed. This question needs debugging details. It is not currently accepting answers. Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question. Closed 10 months ago. Improve this question I have an Scala/Spark question. I’m using Spark 2.1.1. I have a Dataframe
using a sql request in spark sql error in execution
I try to execute this query in pyspark i get all the time error. I have looked everywhere but I don’t know or it doesn’t work if someone can help me. the goal of this request is to update a new column that I will later create called temp_ok : this my code: My table contains this columns: _temp_ok_calculer,Operator level
Group by range of dates from date_start to date_end columns
I have a table with following table structure: I want to count how many events (each row is an event) was in every place by each month. If event dates refer to several months, it should be counted for all affected months. place_id could be repeated, so I did the following query: So I get following grouped table: Problem is
Select rows from a table which contains at-least one alphabet in the column
I have column called name under a table in Databricks. I want to find a way to select only those rows from a table, which contains at-least one alphabet character in the name column. Example values in the column: Expected: I need to pick only those values which contains at least one alphabet in it. Or in other words, I
Spark.sql Filter rows by MAX
Below is part of a source file which you could imagine being much bigger: After the following code: I would like to obtain this result: The aim is to: Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates), Sort by total descending, then date