I’m looking to maintain a (Postgres) SQL database collecting data from third parties. As most data is static, while I get a full dump every day, I want to store only the data that is new. I.e., every day I get 100K new records with say 300 columns, and 95K rows will be the same. In order to do so in an efficient way, I was thinking of inserting a hash of my record (coming from a Pandas dataframe or a Python dict) alongside the data. Some other data is stored as well, like when it was loaded into the database. Then I could, prior to inserting data in the database, hash the incoming data and verify the record is not yet in the database easily, instead of having to check all 300 columns.
My question: which hash function to pick (given that I’m in Python and prefer to use a very fast & solid solution that requires little coding from my side while being able to handle all kinds of data like ints, floats, strings, datetimes, etc)
-
Python’s
hash
is unsuited as it changes for every session (like: Create hash value for each row of data with selected columns in dataframe in python pandas does) -
md5
orsha1
are cryptographic hashes. I don’t need the crypto part, as this is not for security. Might be a bit slow as well, and I had some troubles with strings as these require encoding. - is a solution like
CRC
good enough?
For two and three, if you recommend, how can I implement it for arbitrary dicts and pandas rows? I have had little success in keeping this simple. For instance, for strings I needed to explicitly define the encoding, and the order of the fields in the record should also not change the hash.
Edit: I just realized that it might be tricky to depend on Python for this, if I change programming language I might end up with different hashes. Tying it to the database seems the more sensible choice.
Advertisement
Answer
Have you tried pandas.util.hash_pandas_object
?
Not sure how efficient this is, but maybe you could use it like this:
df.apply(lambda row: pd.util.hash_pandas_object(row), axis=1)
This will at least get you a pandas Series of hashes for each row in the df.