Skip to content
Advertisement

Multithreading to load data to SQL tables?

On a weekly basis, I run a Python script that migrates data from a PostgreSQL server to a Teradata server. One table at a time, this script:

  1. DROP/CREATEs the Teradata version of the table,
  2. pulls the new data from PostgreSQL,
  3. saves a copy of the table data as a CSV in a network drive (for business reasons),
  4. adds the data downloaded from PostgreSQL to the identical table in Teradata.

This happens for 28 tables and it occurred to me that doing this one table at a time is nuts. Conceptually, can I use multithreading to run this process on, say, 5 tables at once and when one is finished continue so that there are always 5 tables being loaded at a single time until the list of tables (28 in total) is exhausted? I can see this as a serious game changer but haven’t found any useful info on it.

Advertisement

Answer

You can speed up your process on concurrent reading and writing operation using

Psycopg2’s ThreadedConnectionPool

Ref: https://pynative.com/psycopg2-python-postgresql-connection-pooling/

But inorder to resolve the race condition, that multiple thread a accessing same db resource, best is to resolve it using below

from psycopg2.pool import ThreadedConnectionPool
from threading import Semaphore

class ReallyThreadedConnectionPool(ThreadedConnectionPool):
    def __init__(self, minconn, maxconn, *args, **kwargs):
        self._semaphore = Semaphore(maxconn)
        super().__init__(minconn, maxconn, *args, **kwargs)

    def getconn(self, *args, **kwargs):
        self._semaphore.acquire()
        return super().getconn(*args, **kwargs)

    def putconn(self, *args, **kwargs):
        super().putconn(*args, **kwargs)
        self._semaphore.release()

Now use ReallyThreadedConnectionPool class functions and python’s Multitreading to achieve your process.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement