create rows from columns in a apache spark dataset

Question

I'm trying from a dataset to create a row from existing columns. Here is my case: InputDataset accountid payingaccountid billedaccountid startdate enddate 0011t00000MY1U3AAL 0011t00000MY1U3XXX 0011t00000ZZ1U3AAL 2020-06-10 00:00:00.000000 NULL And I would like to have sometthing like this accountid startdate enddate 0011t00000MY1U3AAL 2021-06-10 00:00:00.000000 NULL 0011t00000MY1U3XXX 2021-06-10 00:00:00.000000 NULL 0011t00000ZZ1U3AAL 2021-06-10 00:00:00.000000 NULL In the input dataset the columns billedaccounid and

Accepted Answer

You can create an array column with 3 id columns then use explode function to get desired result.val spark = SparkSession.builder().master("local[*]").getOrCreate()spark.sparkContext.setLogLevel("ERROR")import spark.implicits._val df = List(("0011t00000MY1U3AAL", "0011t00000MY1U3XXX", "0011t00000ZZ1U3AAL", "2020-06-10 00:00:00.000000", "NULL"))    .toDF("accountid", "payingaccountid", "billedaccountid", "startdate", "enddate")df.select(    explode_outer(array("accountid","payingaccountid","billedaccountid")).as("accountid"),    'startdate,'enddate).show(false)/*+------------------+--------------------------+-------+|accountid         |startdate                 |enddate|+------------------+--------------------------+-------+|0011t00000MY1U3AAL|2020-06-10 00:00:00.000000|NULL   ||0011t00000MY1U3XXX|2020-06-10 00:00:00.000000|NULL   ||0011t00000ZZ1U3AAL|2020-06-10 00:00:00.000000|NULL   |+------------------+--------------------------+-------+ */

accountid	startdate	enddate
0011t00000MY1U3AAL	2021-06-10 00:00:00.000000	NULL
0011t00000MY1U3XXX	2021-06-10 00:00:00.000000	NULL
0011t00000ZZ1U3AAL	2021-06-10 00:00:00.000000	NULL

Advertisement

Answer