I have a table like this,I am doing normal pivoting,which is not giving the desired result.
x
+-----------------------------------------------------------+-----------------------------------------------------------------------------
|type |column_ref |table_object |
+-----------------------------------------------------------+-----------------------------------------------------------------------------
foreignKeyColumn FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE FRED.FRED.US_REGIONS |
primaryKeyColumn FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS |
foreignKeyColumn FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE FRED.FRED.US_REGIONS |
primaryKeyColumn FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS |
foreignKeyColumn FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE FRED.FRED.US_REGIONS |
primaryKeyColumn FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS |
foreignKeyColumn FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE FRED.FRED.US_REGIONS |
primaryKeyColumn FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS |
+-----------------------------------------------------------+-----------------------------------------------------------------------------
I want to get it in this way:
+-----------------------------------------------------------+-----------------------------------------------------------------------------
|foreignKeyColumn |primaryKeyColumn |table_object |
+-----------------------------------------------------------+-----------------------------------------------------------------------------
FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS
FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS
FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS
FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS
I tried doing it like this :
val grouped = df.groupBy("table_object").pivot("type").agg(first("column_ref"))
But its not giving as expected. Can any one suggest what can be done for the desired result ?
Normal SQL / DF solution both will be helpful.
Advertisement
Answer
Perhaps this is helpful-
pivot
+ arrays_zip
df2.show(false)
df2.printSchema()
/**
* +----------------+------------------------------------------------+--------------------+
* |type |column_ref |table_object |
* +----------------+------------------------------------------------+--------------------+
* |foreignKeyColumn|FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE |FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* |foreignKeyColumn|FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE |FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* |foreignKeyColumn|FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE|FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* |foreignKeyColumn|FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE |FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* +----------------+------------------------------------------------+--------------------+
*
* root
* |-- type: string (nullable = true)
* |-- column_ref: string (nullable = true)
* |-- table_object: string (nullable = true)
*/
val p = df2
.groupBy("table_object")
.pivot("type")
.agg(collect_list("column_ref"))
p
.withColumn("x", arrays_zip($"foreignKeyColumn", $"primaryKeyColumn"))
.selectExpr("table_object", "inline_outer(x)" )
.show(false)
/**
* +--------------------+------------------------------------------------+--------------------------------+
* |table_object |foreignKeyColumn |primaryKeyColumn |
* +--------------------+------------------------------------------------+--------------------------------+
* |FRED.FRED.US_REGIONS|FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE |FRED.FRED.US_REGIONS.REGION_CODE|
* |FRED.FRED.US_REGIONS|FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE |FRED.FRED.US_REGIONS.REGION_CODE|
* |FRED.FRED.US_REGIONS|FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE|FRED.FRED.US_REGIONS.REGION_CODE|
* |FRED.FRED.US_REGIONS|FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE |FRED.FRED.US_REGIONS.REGION_CODE|
* +--------------------+------------------------------------------------+--------------------------------+
*/