Skip to content

Tag: apache-spark

how to Avoid self-join in spark scala

I have a DataFrame called product_relationship_current and I’m doing a self-join to retrieve a new DataFrame like bellow: First I’m giving it an alias so I could consider them like two different dataframes: And then I’m doing a self-join to get a new dataframe: But I’m looking for anot…

Spark.sql Filter rows by MAX

Below is part of a source file which you could imagine being much bigger: After the following code: I would like to obtain this result: The aim is to: Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates), Sort by total descending, …