How does spark SQL access databases

Question

Suppose you access a SQL database with spark SQL. With RDD spark partitions the data into many different parts that all together make the data set. My question is how does Spark SQL manages this access from the N nodes to the database. I can see several possibilities: Each nodes of the RDD access to the database and builds up

Accepted Answer

This is documented quite in detail on the JDBC To Other Databases data sources documentation page.In short, each node involved in the job will establish a connection to the database. However, it isn&#8217;t the node count that determines the number of connections, but the configured number of partitions:  numPartitions:  The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing. And regarding  The JDBC package uses the pooled connections in order to avoid connecting again and again. But this does not address this problem.JDBC drivers don&#8217;t implicitly pool connections. The application sets up, configures, and uses the connection pool. Unless Spark needs to connect to the database repeatedly in order to fetch the data, it doesn&#8217;t need to establish multiple connections per partition. So there would be no need to pool connections.The linked documentation page has a list of options that applications can use to control how the data is fetched from the database.

Advertisement

Answer