Suppose you access a SQL database with spark SQL. With RDD spark partitions the data into many different parts that all together make the data set.
My question is how does Spark SQL manages this access from the N nodes to the database. I can see several possibilities:
Each nodes of the RDD access to the database and builds up their parts. Advantage of it is that the nodes are not forced to allocate a lot of memory, but the database will have to stand N connections with N potentially very large.
A single node access the data and sends the data to the other N-1 nodes as required. The problem is that this single node will need to have all the data and this is unworkable in many cases. Possibly this can be alleviated by getting the data by chunks.
The JDBC package uses the pooled connections in order to avoid connecting again and again. But this does not address this problem.
What would be a reference explaining how spark manages this access to SQL database? How much of it can be parametrized?
Advertisement
Answer
This is documented quite in detail on the JDBC To Other Databases data sources documentation page.
In short, each node involved in the job will establish a connection to the database. However, it isn’t the node count that determines the number of connections, but the configured number of partitions:
numPartitions
: The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.
And regarding
The JDBC package uses the pooled connections in order to avoid connecting again and again. But this does not address this problem.
JDBC drivers don’t implicitly pool connections. The application sets up, configures, and uses the connection pool. Unless Spark needs to connect to the database repeatedly in order to fetch the data, it doesn’t need to establish multiple connections per partition. So there would be no need to pool connections.
The linked documentation page has a list of options that applications can use to control how the data is fetched from the database.