I am converted a pandas dataframe into spark sql table. I am new to SQL and would like to select the key ‘code’ from table.
query
x
sqlContext.sql("""SELECT `classification` FROM psyc""").show()
query response
+--------------------+
| classification|
+--------------------+
|[{'code': '3297',|
|[{'code': '3410',|
|[{'code': '3410',|
|[{'code': '2227',|
|[{'code': '3410',|
+--------------------+
How can I select the key ‘code’. The column contains a list of dict which contain the data.
sqlContext.sql("""SELECT `classification.code` FROM psyc""").show() # this query does not work
Here is the rest of the code
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession
.builder
.appName("Python Spark SQL ")
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
fp = os.path.join(BASE_DIR,'psyc.csv')
df = spark.read.csv(fp,header=True)
df.printSchema()
df.createOrReplaceTempView("psyc")
This creates a table with following schema
Advertisement
Answer
The field classification
is of type string so first you have to convert it to struct type, after that you can directly select as classification.code
. To convert from string to struct try below.
//Sample Dataframe
from pyspark.sql.types import *
df=spark.createDataFrame([(1,"[{'code':'1234','name':'manoj'},{'code':'124','name':'kumar'},{'code':'4567','name':'dhakad'}]",),(2,"[{'code':'97248','name':'joe'},{'code':'2424','name':'alice'},{'code':'464','name':'bob'}]",)],["id","classification",])
//df will be below
+---+--------------------+
| id| classification|
+---+--------------------+
| 1|[{'code':'1234','...|
| 2|[{'code':'97248',|
+---+--------------------+
//here is schema of above df
root
|-- id: long (nullable = true)
|-- classification: string (nullable = true)
//df after converting classification column to the struct type and selecting only code.
schema = ArrayType(StructType([StructField('code', StringType()), StructField('name', StringType())]))
df1=df.withColumn('classification',from_json(col("classification"),schema=schema))
df2=df1.withColumn("code",col("classification.code"))
+---+--------------------+------------------+
| id| classification| code|
+---+--------------------+------------------+
| 1|[[1234,manoj], [1...| [1234, 124, 4567]|
| 2|[[97248,joe], [24...|[97248, 2424, 464]|
+---+--------------------+------------------+
//Here, I am going to select id and while exploding code column
df3=df2.select(col("id"),explode(col("code")))
df3.show()
//df3 output
+---+-----+
| id| col|
+---+-----+
| 1| 1234|
| 1| 124|
| 1| 4567|
| 2|97248|
| 2| 2424|
| 2| 464|
+---+-----+