Skip to content

Tag: apache-spark-sql

Spark.sql Filter rows by MAX

Below is part of a source file which you could imagine being much bigger: After the following code: I would like to obtain this result: The aim is to: Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates), Sort by total descending, then date

Spark: How to transpose and explode columns with dynamic nested arrays

I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays. I have added to the dataframe “””{“id”:3,”c”:[{“date”:3,”val”:3, “val_dynamic”:3}]}}””” , with new column c, where array has new val_dynamic field which can appear on random basis. I’m looking for required output 2 (Transpose and

Missing rows in full outer join

I am trying to count how many users are observed on each of the 3 consecutive days. Each of the 3 intermediate tables (t0, t1, t2) has 2 columns: uid (unique ID) and d0 (or d1 or d2, which is 1 and indicates that the user is observed on that day). The following query: produces this output from spark.sql(q).toPandas().set_index([“d0″,”d1″,”d2”]): Two