I have following table:
Id col1 col2 1 a 1 1 b 2 1 c 3 2 a 1 2 e 3 2 f 4
Expected output is:
Id col3 1 a1b2c3 2 a1e3f4
The aggregation computation involves 2 columns, is this supported in SQL?
Advertisement
Answer
In Spark SQL you can do it like this:
SELECT Id, aggregate(list, '', (acc, x) -> concat(acc, x)) col3 FROM (SELECT Id, array_sort(collect_list(concat(col1, col2))) list FROM df GROUP BY Id )
or in one select:
SELECT Id, aggregate(array_sort(collect_list(concat(col1, col2))), '', (acc, x) -> concat(acc, x)) col3 FROM df GROUP BY Id
Higher-order aggregate
function is used in this example.
aggregate(expr, start, merge, finish) – Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.