SparkSQL query dataframe

Question

I am converted a pandas dataframe into spark sql table. I am new to SQL and would like to select the key 'code' from table. query sqlContext.sql("""SELECT `classification` FROM psyc""").show() ...

Accepted Answer

The field classification is of type string so first you have to convert it to struct type, after that you can directly select as classification.code. To convert from string to struct try below.//Sample Dataframefrom pyspark.sql.types import *df=spark.createDataFrame([(1,"[{'code':'1234','name':'manoj'},{'code':'124','name':'kumar'},{'code':'4567','name':'dhakad'}]",),(2,"[{'code':'97248','name':'joe'},{'code':'2424','name':'alice'},{'code':'464','name':'bob'}]",)],["id","classification",])//df will be below+---+--------------------+| id|      classification|+---+--------------------+|  1|[{'code':'1234','...||  2|[{'code':'97248',...|+---+--------------------+//here is schema of above dfroot    |-- id: long (nullable = true)    |-- classification: string (nullable = true) //df after converting classification column to the struct type and selecting only code.schema = ArrayType(StructType([StructField('code', StringType()), StructField('name', StringType())]))df1=df.withColumn('classification',from_json(col("classification"),schema=schema))df2=df1.withColumn("code",col("classification.code"))+---+--------------------+------------------+| id|      classification|              code|+---+--------------------+------------------+|  1|[[1234,manoj], [1...| [1234, 124, 4567]||  2|[[97248,joe], [24...|[97248, 2424, 464]|+---+--------------------+------------------+//Here, I am going to select id and while exploding code columndf3=df2.select(col("id"),explode(col("code")))df3.show()//df3 output+---+-----+| id|  col|+---+-----+|  1| 1234||  1|  124||  1| 4567||  2|97248||  2| 2424||  2|  464|+---+-----+

Advertisement

Answer