Skip to content
Advertisement

Cyrillic symbols in SELECT query PostgreSQL

I have a telegram bot (Aiogram, Python), which is connected to PostgreSQL database.

If user enters a some username (in Cyrillic, Russian or Ukrainian), bot should return full information about this user, stored in database.

The problem is: when bot making query to the database, with username for example “Сергій”, it returns nothing, although information exists in database.

SELECT * 
FROM users 
WHERE username = 'Сергій';

I found out that problem is in symbols like ‘c’ or ‘i’, that looks similar, but have different hex code in UTF-8. If we’ll change query this way all works fine

SELECT * 
FROM users 
WHERE username LIKE '_ерг_й';

I’ve already changed database encoding to UTF-8, but still have this problem. If i copy username directly from database and paste it into bot’s input field – it works.

Here’s main fragments of my code:

@dp.message_handler()
async def username_input_handler(message: Message):
    username = message.text

    answer_text = await db.search_by_username(username)
    await message.answer(text=answer_text)

# database class
class Database():
    ...
    async def search_by_username(username):
        query = "SELECT * FROM users WHERE username LIKE $1"
        user = await self.pool.execute(query, username)
        return user
    ...

UPDATE:

Problem occurs only when I am querying the ‘users’ table. This table’s data was imported from a .txt file.

Running the same queries against another table works fine.

Advertisement

Answer

If the users enter the exact same characters as are in the database, the query will work, but obviously they aren’t. You probably need similarity search:

CREATE EXTENSION pg_trgm;

CREATE INDEX ON users USING gist (username gist_trgm_ops);

SELECT *
FROM users
ORDER BY username <-> 'Сергій'
LIMIT 1;

This will probably not be satisfactory for short names like Сергій. If both С and і are the wrong character, there is only one trigram left to match: ерг. And this may not find the closest match that you are looking for.

You may need to write a function cyrillic_normalize that replaces every non-Cyrillic letter with the “closest” Cyrillic letter, like

CREATE FUNCTION cyrillic_normalize(text) RETURNS text
   LANGUAGE sql AS
$$SELECT translate ($1, 'Ci', 'Сі')$$;

where the first string contains the Latin letters and the second the corresponding Cyrillic ones.

Then change your query to

WHERE cyrillic_normalize(username) = cyrillic_normalize('Сергій')
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement