hello i installed pyspark now and i have a database postgres in local in DBeaver : how can i connect to postgres from pyspark please
i tried this
from pyspark.sql import DataFrameReader url = 'postgresql://localhost:5432/coucou' properties = {'user': 'postgres', 'password': 'admin'} df = DataFrameReader(sqlContext).jdbc( url='jdbc:%s' % url, table='tw_db', properties=properties )
but i have an error
File "C:Sparkspark-3.1.2-bin-hadoop3.2pythonlibpy4j-0.10.9-src.zippy4jprotocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o37.jdbc. : java.lang.ClassNotFoundException: C:/Users/Desktop/postgresql-42.2.23.jre7.jar
Advertisement
Answer
You need to add the jars you want to use when creating the sparkSession.
See this : https://spark.apache.org/docs/2.4.7/submitting-applications.html#advanced-dependency-management
Either when you start pyspark
pyspark --repositories MAVEN_REPO # OR pyspark --jars PATH_TO_JAR
or when you create your sparkSession objects
SparkSession.builder.master("yarn").appName(app_name).config("spark.jars.packages", "MAVEN_PACKAGE") # OR SparkSession.builder.master("yarn").appName(app_name).config("spark.jars", "PATH_TO_JAR")
You need maven packages when you do not have the jar in local or your jars needs some dependencies.