I’m trying from a dataset to create a row from existing columns. Here is my case:
InputDataset
accountid | payingaccountid | billedaccountid | startdate | enddate |
---|---|---|---|---|
0011t00000MY1U3AAL | 0011t00000MY1U3XXX | 0011t00000ZZ1U3AAL | 2020-06-10 00:00:00.000000 | NULL |
And I would like to have sometthing like this
accountid | startdate | enddate |
---|---|---|
0011t00000MY1U3AAL | 2021-06-10 00:00:00.000000 | NULL |
0011t00000MY1U3XXX | 2021-06-10 00:00:00.000000 | NULL |
0011t00000ZZ1U3AAL | 2021-06-10 00:00:00.000000 | NULL |
In the input dataset the columns billedaccounid and payingaccounid are now also in accountid column.
Thank you in advance.
Advertisement
Answer
You can create an array column with 3 id columns then use explode function to get desired result.
val spark = SparkSession.builder().master("local[*]").getOrCreate() spark.sparkContext.setLogLevel("ERROR") import spark.implicits._ val df = List(("0011t00000MY1U3AAL", "0011t00000MY1U3XXX", "0011t00000ZZ1U3AAL", "2020-06-10 00:00:00.000000", "NULL")) .toDF("accountid", "payingaccountid", "billedaccountid", "startdate", "enddate") df.select( explode_outer(array("accountid","payingaccountid","billedaccountid")).as("accountid"), 'startdate,'enddate).show(false) /* +------------------+--------------------------+-------+ |accountid |startdate |enddate| +------------------+--------------------------+-------+ |0011t00000MY1U3AAL|2020-06-10 00:00:00.000000|NULL | |0011t00000MY1U3XXX|2020-06-10 00:00:00.000000|NULL | |0011t00000ZZ1U3AAL|2020-06-10 00:00:00.000000|NULL | +------------------+--------------------------+-------+ */