Below is my script to use sql in dataframe with python:
pyspark --packages com.databricks:spark-csv_2.10:1.4.0 from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file:///root/Downloads/data/flight201601short.csv')
df.show(5) shows result below:
then I register the dataframe to a temp table:
df.registerTempTable("flight201601")
and tried to run some sql query like below:
sqlContext.sql("select distinct CARRIER from flight201601")
It doesn’t produce expected result, instead:
I also tried:
sqlContext.sql("select * from flight201601")
and it gives me:
So it seems the registerTempTable method only create the table schema and the table is NOT populated, what am I missing?
Advertisement
Answer
You will have to call show()
method on Dataframe returned by sqlContext.sql
to get the result of query. Check the spark document which says
The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame.
sqlDF = sqlContext.sql("select * from flight201601") sqlDF.show()