Skip to content

Tag: apache-spark

how to Avoid self-join in spark scala

I have a DataFrame called product_relationship_current and I’m doing a self-join to retrieve a new DataFrame like bellow: First I’m giving it an alias so I could consider them like two different dataframes: And then I’m doing a self-join to get a new dataframe: But I’m looking for another way to do that without doing a self-join, so I don’t

Filter a Dataframe using a subset of it and two specific fields in spark/scala [closed]

Closed. This question needs debugging details. It is not currently accepting answers. Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question. Closed 10 months ago. Improve this question I have an Scala/Spark question. I’m using Spark 2.1.1. I have a Dataframe

Spark.sql Filter rows by MAX

Below is part of a source file which you could imagine being much bigger: After the following code: I would like to obtain this result: The aim is to: Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates), Sort by total descending, then date