Tag: apache-spark

how to Avoid self-join in spark scala

apache-spark dataframe scala self-join sql

I have a DataFrame called product_relationship_current and I’m doing a self-join to retrieve a new DataFrame like bellow: First I’m giving it an alias so I could consider them like two different dataframes: And then I’m doing a self-join to get a new dataframe: But I’m looking for anot…

How to aggregate on multiple columns using SQL or spark SQL

aggregate apache-spark apache-spark-sql multiple-columns sql

I have following table: Expected output is: The aggregation computation involves 2 columns, is this supported in SQL? Answer In Spark SQL you can do it like this: or in one select: Higher-order aggregate function is used in this example. aggregate(expr, start, merge, finish) – Applies a binary operator …

filter stop words from text column – spark SQL

apache-spark apache-spark-sql pyspark sql

I’m using spark SQL and have a data frame with user IDs & reviews of products. I need to filter stop words from the reviews, and I have a text file with stop words to filter. I managed to split the reviews to lists of strings, but don’t know how to filter. this is what I tried to do: thanks!

sql – how to join on a column that is less than another join key

apache-spark apache-spark-sql dataframe sql

I have two tables as below. What I’m trying to do is to join A and B base on date and id, to get the value from B. The problem is, I want to join using add_month(A.Date, -1) = B.month (find the data in table B from one month earlier). If that’s not available, I want to join using two

Filter a Dataframe using a subset of it and two specific fields in spark/scala [closed]

apache-spark join scala sql union

Closed. This question needs debugging details. It is not currently accepting answers. Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question. Closed 10 months ago. Improve this question I h…

using a sql request in spark sql error in execution

apache-spark apache-spark-sql pyspark sql

I try to execute this query in pyspark i get all the time error. I have looked everywhere but I don’t know or it doesn’t work if someone can help me. the goal of this request is to update a new column that I will later create called temp_ok : this my code: My table contains this columns: _temp_ok_…

Group by range of dates from date_start to date_end columns

apache-spark apache-spark-sql databricks sql

I have a table with following table structure: I want to count how many events (each row is an event) was in every place by each month. If event dates refer to several months, it should be counted for all affected months. place_id could be repeated, so I did the following query: So I get following grouped tab…

Select rows from a table which contains at-least one alphabet in the column

apache-spark apache-spark-sql databricks pyspark sql

I have column called name under a table in Databricks. I want to find a way to select only those rows from a table, which contains at-least one alphabet character in the name column. Example values in the column: Expected: I need to pick only those values which contains at least one alphabet in it. Or in othe…

Spark.sql Filter rows by MAX

apache-spark apache-spark-sql pyspark sql

Below is part of a source file which you could imagine being much bigger: After the following code: I would like to obtain this result: The aim is to: Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates), Sort by total descending, …