Skip to content
Advertisement

Pivot rows in a different way using MySQL or SparkDataframe

I have a table like this,I am doing normal pivoting,which is not giving the desired result.

+-----------------------------------------------------------+-----------------------------------------------------------------------------
|type                                                       |column_ref                                            |table_object         |
+-----------------------------------------------------------+-----------------------------------------------------------------------------
foreignKeyColumn                                             FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE       FRED.FRED.US_REGIONS  |
primaryKeyColumn                                             FRED.FRED.US_REGIONS.REGION_CODE                      FRED.FRED.US_REGIONS  |

foreignKeyColumn                                             FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE            FRED.FRED.US_REGIONS  |
primaryKeyColumn                                             FRED.FRED.US_REGIONS.REGION_CODE                      FRED.FRED.US_REGIONS  |

foreignKeyColumn                                             FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE      FRED.FRED.US_REGIONS  |
primaryKeyColumn                                             FRED.FRED.US_REGIONS.REGION_CODE                      FRED.FRED.US_REGIONS  |

foreignKeyColumn                                             FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE              FRED.FRED.US_REGIONS  |
primaryKeyColumn                                             FRED.FRED.US_REGIONS.REGION_CODE                      FRED.FRED.US_REGIONS  |
+-----------------------------------------------------------+-----------------------------------------------------------------------------

I want to get it in this way:

+-----------------------------------------------------------+-----------------------------------------------------------------------------
|foreignKeyColumn                                           |primaryKeyColumn                                            |table_object   |
+-----------------------------------------------------------+-----------------------------------------------------------------------------
FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE             FRED.FRED.US_REGIONS.REGION_CODE                              FRED.FRED.US_REGIONS
FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE                  FRED.FRED.US_REGIONS.REGION_CODE                              FRED.FRED.US_REGIONS
FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE            FRED.FRED.US_REGIONS.REGION_CODE                              FRED.FRED.US_REGIONS
FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE                    FRED.FRED.US_REGIONS.REGION_CODE                              FRED.FRED.US_REGIONS

I tried doing it like this :

  val grouped = df.groupBy("table_object").pivot("type").agg(first("column_ref"))

But its not giving as expected. Can any one suggest what can be done for the desired result ?

Normal SQL / DF solution both will be helpful.

Advertisement

Answer

Perhaps this is helpful-

pivot + arrays_zip

df2.show(false)
    df2.printSchema()
    /**
      * +----------------+------------------------------------------------+--------------------+
      * |type            |column_ref                                      |table_object        |
      * +----------------+------------------------------------------------+--------------------+
      * |foreignKeyColumn|FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE |FRED.FRED.US_REGIONS|
      * |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE                |FRED.FRED.US_REGIONS|
      * |foreignKeyColumn|FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE      |FRED.FRED.US_REGIONS|
      * |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE                |FRED.FRED.US_REGIONS|
      * |foreignKeyColumn|FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE|FRED.FRED.US_REGIONS|
      * |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE                |FRED.FRED.US_REGIONS|
      * |foreignKeyColumn|FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE        |FRED.FRED.US_REGIONS|
      * |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE                |FRED.FRED.US_REGIONS|
      * +----------------+------------------------------------------------+--------------------+
      *
      * root
      * |-- type: string (nullable = true)
      * |-- column_ref: string (nullable = true)
      * |-- table_object: string (nullable = true)
      */
    val p = df2
      .groupBy("table_object")
      .pivot("type")
      .agg(collect_list("column_ref"))
    
    p
      .withColumn("x", arrays_zip($"foreignKeyColumn", $"primaryKeyColumn"))
      .selectExpr("table_object", "inline_outer(x)" )
      .show(false)

    /**
      * +--------------------+------------------------------------------------+--------------------------------+
      * |table_object        |foreignKeyColumn                                |primaryKeyColumn                |
      * +--------------------+------------------------------------------------+--------------------------------+
      * |FRED.FRED.US_REGIONS|FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE |FRED.FRED.US_REGIONS.REGION_CODE|
      * |FRED.FRED.US_REGIONS|FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE      |FRED.FRED.US_REGIONS.REGION_CODE|
      * |FRED.FRED.US_REGIONS|FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE|FRED.FRED.US_REGIONS.REGION_CODE|
      * |FRED.FRED.US_REGIONS|FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE        |FRED.FRED.US_REGIONS.REGION_CODE|
      * +--------------------+------------------------------------------------+--------------------------------+
      */
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement