Skip to content
Advertisement

GROUP BY with overlapping rows in PySpark SQL

The following table was created using Parquet / PySpark, and the objective is to aggregate rows where 1 < count < 5 and rows where 2 < count < 6. Note the row where count is 4.1 falls in both ranges.

Here is code to create and then read the above table as a PySpark DataFrame.

The operation can use two separate queries.

and, similarly

However I want to do this in one efficient query. Here is an approach that does not work because the row where count is 4.1 is included in only one group.

The above query produces

To be clear the desired result is something more like

Advertisement

Answer

The simplest method is probably union all:

You can also phrase this as:

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement