Skip to content
Advertisement

As a result of the Euclidean distance many images are mistakenly identified as identical

Sorry for my bad english, i have a small database that contains hashes of photos, when I try to find similar photos to the one below:

for which the following hash was calculated: “0f3f2764ecc482c2” using the method average_hash()

imagehash.average_hash(Image.open(imagePath))

The system finds a very large number of collisions, below is an example of photos that were identified as completely identical:

The table in which I store hashes of photos:

CREATE TABLE IF NOT EXISTS photos(id_photo BIGINT PRIMARY  KEY,  photo_hash TEXT,  FOREIGN KEY (id_photo) REFERENCES users (id));

Adding hash:

cur.execute('''INSERT INTO photos(id_photo, photo_hash) VALUES(%s, %s)''', id, str(result_image_recognition['photo_hash'])])

SQL Query by which I calculate the Euclidean distance between my hash and stored hashes:

SELECT id_photo, BIT_COUNT(photo_hash ^ '0f3f2764ecc482c2') FROM photos;

The photos table contains 7889 photos, 959 of them are mistakenly determined by this query as completely identical (Euclidean distance is 0). About a week I can not solve this problem please someone help me.

Advertisement

Answer

You need to convert the hex strings to integers before doing xor operations

SELECT id_photo, BIT_COUNT(CONV(photo_hash, 16, 10) ^ CONV('0f3f2764ecc482c2', 16, 10)) FROM photos;

because all strings the first character of which is not 1-9 are converted to 0.

mysql> select 'abc' ^ 'def'; # -> 0
mysql> select CONV('abc', 16, 10) ^ CONV('def', 16, 10);  # -> 1875
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement