I am trying to retrieve information from a database using a Python tuple containing a set of ids (between 1000 and 10000 ids), but my query uses the IN statement and is subsequently very slow.
query = """ SELECT * 
            FROM table1
            LEFT JOIN table2 ON table1.id = table2.id
            LEFT JOIN ..
            LEFT JOIN ...
            WHERE table1.id IN {} """.format(my_tuple)
and then I query the database using PostgreSQL to charge the result in a Pandas dataframe:
with tempfile.TemporaryFile() as tmpfile:
            copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
               query=query, head="HEADER"
            )
            conn = db_engine.raw_connection()
            cur = conn.cursor()
            cur.copy_expert(copy_sql, tmpfile)
            tmpfile.seek(0)
            df = pd.read_csv(tmpfile, low_memory=False)
I know that IN is not very efficient with a high number of parameters, but I do not have any idea to optimise this part of the query. Any hint?
Advertisement
Answer
You could debug your query using explain statement. Probably you are trying to sequently read big table while needing only a few rows. Is field table1.id indexed?
Or you could try to filter table1 first and then start joining
with t1 as (
 select f1,f2, .... from table1 where id in {}
)
select *
from t1
 left join ....