Skip to content
Advertisement

How to aggregate on multiple columns using SQL or spark SQL

I have following table:

Id col1 col2
1  a    1   
1  b    2   
1  c    3   
2  a    1   
2  e    3   
2  f    4  

Expected output is:

Id col3
1  a1b2c3
2  a1e3f4

The aggregation computation involves 2 columns, is this supported in SQL?

Advertisement

Answer

In Spark SQL you can do it like this:

SELECT Id, aggregate(list, '', (acc, x) -> concat(acc, x)) col3
FROM (SELECT Id, array_sort(collect_list(concat(col1, col2))) list
      FROM df
      GROUP BY Id )

or in one select:

SELECT Id, aggregate(array_sort(collect_list(concat(col1, col2))), '', (acc, x) -> concat(acc, x)) col3
FROM df
GROUP BY Id

Higher-order aggregate function is used in this example.

aggregate(expr, start, merge, finish) – Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement