Skip to content
Advertisement

Aggregate data from multiple rows to one and then nest the data

I’m relatively new to scala and spark programming.

I have a use case where I need to groupby data based on certain columns and have a count of a certain column (using pivot) and then finally I need to create a nested dataframe out of my flat dataframe.

One major challenge I am facing is I need to retain certain other columns as well (not the one I am pivoting on).

I’m not able to figure out an efficient way to do it.

INPUT

Now say, I want to pivot on ‘country’ and group by on (‘ID’,’ID2′,’ID3′) But I also want to maintain the other columns as a list.

For instance,

OUTPUT-1 :

Once I achieve this,

I want to nest it into a nested structure such that my schema looks like :

I believe I can use a case class and then map every row of the dataframe to it. However, I am not sure if that is an efficient way.I would love to know if there is a more optimised way to achieve this.

What I have in mind is something on these lines :

and so on…

Advertisement

Answer

I think this should work for your case. The partitions number is calculated from the formula partitions_num = data_size / 500MB.

Good luck and let me know if you need any clarification.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement