Skip to content
Advertisement

is there a method to conect to postgresql (dbeaver ) from pyspark?

hello i installed pyspark now and i have a database postgres in local in DBeaver : how can i connect to postgres from pyspark please

i tried this

from pyspark.sql import DataFrameReader

url = 'postgresql://localhost:5432/coucou'
properties = {'user': 'postgres', 'password': 'admin'}
df = DataFrameReader(sqlContext).jdbc(
    url='jdbc:%s' % url, table='tw_db', properties=properties
)

but i have an error

  File "C:Sparkspark-3.1.2-bin-hadoop3.2pythonlibpy4j-0.10.9-src.zippy4jprotocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.jdbc.
: java.lang.ClassNotFoundException: C:/Users/Desktop/postgresql-42.2.23.jre7.jar

Answer

You need to add the jars you want to use when creating the sparkSession.

See this : https://spark.apache.org/docs/2.4.7/submitting-applications.html#advanced-dependency-management

Either when you start pyspark

pyspark --repositories MAVEN_REPO
# OR
pyspark --jars PATH_TO_JAR

or when you create your sparkSession objects

SparkSession.builder.master("yarn").appName(app_name).config("spark.jars.packages", "MAVEN_PACKAGE")
# OR
SparkSession.builder.master("yarn").appName(app_name).config("spark.jars", "PATH_TO_JAR")

You need maven packages when you do not have the jar in local or your jars needs some dependencies.