On a weekly basis, I run a Python script that migrates data from a PostgreSQL server to a Teradata server. One table at a time, this script:
- DROP/CREATEs the Teradata version of the table,
- pulls the new data from PostgreSQL,
- saves a copy of the table data as a CSV in a network drive (for business reasons),
- adds the data downloaded from PostgreSQL to the identical table in Teradata.
This happens for 28 tables and it occurred to me that doing this one table at a time is nuts. Conceptually, can I use multithreading to run this process on, say, 5 tables at once and when one is finished continue so that there are always 5 tables being loaded at a single time until the list of tables (28 in total) is exhausted? I can see this as a serious game changer but haven’t found any useful info on it.
Advertisement
Answer
You can speed up your process on concurrent reading and writing operation using
Psycopg2’s ThreadedConnectionPool
Ref: https://pynative.com/psycopg2-python-postgresql-connection-pooling/
But inorder to resolve the race condition, that multiple thread a accessing same db resource, best is to resolve it using below
from psycopg2.pool import ThreadedConnectionPool from threading import Semaphore class ReallyThreadedConnectionPool(ThreadedConnectionPool): def __init__(self, minconn, maxconn, *args, **kwargs): self._semaphore = Semaphore(maxconn) super().__init__(minconn, maxconn, *args, **kwargs) def getconn(self, *args, **kwargs): self._semaphore.acquire() return super().getconn(*args, **kwargs) def putconn(self, *args, **kwargs): super().putconn(*args, **kwargs) self._semaphore.release()
Now use ReallyThreadedConnectionPool
class functions and python’s Multitreading
to achieve your process.