I am looking to implement a cache in MySQL. I’m doing this because the app I’m developing needs a cache but has no persistence.
I am keeping a collection of SHA-1 checksums for files. The user can provide me with a list of checksums, and I need to return to them which checksums I do not have in my database. I’m expecting the number of checksums in my database to be on the order of a million, and the number of values to check to be on the order of a hundred thousand.
The naive approach would be to issue a long string of
SELECT COUNT(*) FROM myTable WHERE checksum = '2fd4e1c67a2d28fced849ee1bb76e7391b93eb12';
This could be optimized somewhat by using the ability to use parameters with the Python connector. However, this seems like a lot of churn across the connection between the python process and the MySQL process.
I know there is an approach along the lines of SELECT checksum FROM myTable NOT IN (SELECT someOtherTable WHERE someClause)
, which is a match for what I need except that the list of checksums to check against is in my Python process, not a MySQL table.
What is the best way to find checksums that are not in a list which is initially only available in my Python process (so this must be shared across to MySQL across IPC)?
Advertisement
Answer
Your table must have an index by checksum field.
When user provides a list of checksums you must create temporary table, store these values into it, build the index by shecksum field and search using sommon simple query with JOIN.
The fastest method to insert the values list to be searched for is to save them into texst file and load into temporary table using LOAD DATA INFILE. Applicable if client have enough rights to create a file in the MySQL’s server filesystem.
If it is impossible then insert the values by chunks.