I’m trying to load a remote Oracle Database table on to Apache Spark shell.
This is how I started the spark-shell.
./spark-shell --driver-class-path ../jars/ojdbc6.jar --jars ../jars/ojdbc6.jar --master local
And I get a Scala prompt, where I try to load an Oracle database table like below. (I use a custom JDBC URL)
val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=WHATEVER)(HOST=myDummyHost.com)(PORT=xxxx)))(CONNECT_DATA=(SERVICE_NAME=dummy)(INSTANCE_NAME=dummyKaMummy)(UR=A)(SERVER=DEDICATED)))").option("dbtable", "THE_DUMMY_TABLE").option("user", "DUMMY_USER").option("password", "DUMMYPASSWORD").option("driver", "oracle.jdbc.driver.OracleDriver").load()
(Replaced employer data with dummy variables)
And then I get this error.
java.sql.SQLException: Unrecognized SQL type -102 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:246) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:316) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:316) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:315) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) ... 49 elided
I tried to see if there is an issue with the quotes, but it’s not that.
Can somebody save my life, please?
Advertisement
Answer
The problem is an incompatible field in the database. If you cannot modify the database, but would still like to read it, the solution would be to ignore specific columns (in my case it’s a field with type geography
). With the help of How to select specific columns through Spack JDBC?, here’s a solution in pyspark (scala solution would be similar):
df = spark.read.jdbc(url=connectionString, table="(select colName from Table) as CompatibleTable", properties=properties)