I’ve inherited a database which contains .png images which have been converted byte[]
and finally ToBase64String
and saved in a table with many duplicate images. I want to clean this up create a new table with unique images but the original tables have no primary key related to the image. I’ve come up with the following idea but would like some feedback on viability or better suggestions.
i) Create new table with 2 columns
PrimaryKey varchar(64) ImageBinary varbinary(MAX)
ii) Convert Base64String back to image byte[] and store in SQL as ImageBinary varbinary(MAX)
iii) Hash image using HASHBYTES('SHA2_256', ImageBinary)
and store in SQL as PrimaryKey varchar(64)
iv) Any new images will be hashed and compared against existing primary key hashes.
Current stats approx
- Images in db approx 200,000
- Unique images in db 70,000
- Potential future unique images in db 1,000,000
- Current data character length of ToBase64String in SQL Min:2,900 Max: 170,000
Two main questions
The reason for hashing the image was I thought it may speed up checking for duplicates by comparing new image hash vs existing image hash. An alternative could be just compare the full image and use integer for primary key. Is it better to hash?
If hashing is preferred, I don’t really need cryptographic hashing, what’s the fastest/shortest hashing that can be used in SQL for this?
Advertisement
Answer
- I would store an integer uid – if you need to join, comparing 4 bytes is way faster than 16 Bytes or more. You can still store a hash in addition, if you need it, and you can also put a unique constaint on the hash if the hash isn’t the primary key.
- I’d say MD5. SHA-1 might be slightly faster, but it uses more space (20 bytes instead of 16 PER HASH).
Finally, I wouldn’t store 1 million files in the database.
Have a look at filestream, that way you can access the image via the database, but it’s stored outside the database. Otherwise, making database backups will become a problem.
Have a look at the filestream tutorial from Red-Gate:
https://www.red-gate.com/simple-talk/sql/learn-sql-server/an-introduction-to-sql-server-filestream/
Example for MD5:
SET IMG_MD5 = LOWER(SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', FIELD_NAME)), 3, 32) ) ,IMG_SHA1 = sys.fn_varbintohexsubstring(0, HashBytes('SHA1', FIELD_NAME), 1, 0)