I am converted a pandas dataframe into spark sql table. I am new to SQL and would like to select the key ‘code’ from table.
query
sqlContext.sql("""SELECT `classification` FROM psyc""").show()
query response
+--------------------+ | classification| +--------------------+ |[{'code': '3297',...| |[{'code': '3410',...| |[{'code': '3410',...| |[{'code': '2227',...| |[{'code': '3410',...| +--------------------+
How can I select the key ‘code’. The column contains a list of dict which contain the data.
sqlContext.sql("""SELECT `classification.code` FROM psyc""").show() # this query does not work
Here is the rest of the code
from pyspark.sql import SparkSession from pyspark.sql import SQLContext spark = SparkSession .builder .appName("Python Spark SQL ") .getOrCreate() sc = spark.sparkContext sqlContext = SQLContext(sc) fp = os.path.join(BASE_DIR,'psyc.csv') df = spark.read.csv(fp,header=True) df.printSchema() df.createOrReplaceTempView("psyc")
This creates a table with following schema
Advertisement
Answer
The field classification
is of type string so first you have to convert it to struct type, after that you can directly select as classification.code
. To convert from string to struct try below.
//Sample Dataframe from pyspark.sql.types import * df=spark.createDataFrame([(1,"[{'code':'1234','name':'manoj'},{'code':'124','name':'kumar'},{'code':'4567','name':'dhakad'}]",),(2,"[{'code':'97248','name':'joe'},{'code':'2424','name':'alice'},{'code':'464','name':'bob'}]",)],["id","classification",]) //df will be below +---+--------------------+ | id| classification| +---+--------------------+ | 1|[{'code':'1234','...| | 2|[{'code':'97248',...| +---+--------------------+ //here is schema of above df root |-- id: long (nullable = true) |-- classification: string (nullable = true) //df after converting classification column to the struct type and selecting only code. schema = ArrayType(StructType([StructField('code', StringType()), StructField('name', StringType())])) df1=df.withColumn('classification',from_json(col("classification"),schema=schema)) df2=df1.withColumn("code",col("classification.code")) +---+--------------------+------------------+ | id| classification| code| +---+--------------------+------------------+ | 1|[[1234,manoj], [1...| [1234, 124, 4567]| | 2|[[97248,joe], [24...|[97248, 2424, 464]| +---+--------------------+------------------+ //Here, I am going to select id and while exploding code column df3=df2.select(col("id"),explode(col("code"))) df3.show() //df3 output +---+-----+ | id| col| +---+-----+ | 1| 1234| | 1| 124| | 1| 4567| | 2|97248| | 2| 2424| | 2| 464| +---+-----+