I’m using the full text search feature from Postgres and for the most part it works fine.
I have a column in my database table called documentFts
that is basically the ts_vector
version of the body
field, which is a text column, and that’s indexed with GIN index.
Here’s my query:
select count(*) OVER() AS full_count, id, url, (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) as "finalScore", ts_headline('english_unaccent', title, websearch_to_tsquery($4, $1)) as title, ts_headline('english_unaccent', body, websearch_to_tsquery($4, $1)) as body, "possibleEncoding", "responseYear" from "Entries" where "language" = $3 and "documentFts" @@ websearch_to_tsquery($4, $1) order by (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) desc limit 20 offset $2;
The dictionary is english_unaccent
because I created one based on english
that uses the unaccent
extension by using:
CREATE TEXT SEARCH CONFIGURATION english_unaccent ( COPY = english ); ALTER TEXT SEARCH CONFIGURATION english_unaccent ALTER MAPPING FOR hword, hword_part, word WITH unaccent, english_stem;
I did the same for other languages.
And then I did this to my Entries db:
ALTER TABLE "Entries" ADD COLUMN "documentFts" tsvector; UPDATE "Entries" SET "documentFts" = (setweight(to_tsvector('english_unaccent', coalesce(title)), 'A') || setweight(to_tsvector('english_unaccent', coalesce(body)), 'C')) WHERE "language" = 'english';
I have a column in my table with the language of the entry, hence the "language" = 'english'
.
So, the problem I’m having is that for words like animal
, anime
or animation
, they all go into the vector as anim
, which means that if I search for any of those words I get results with all of those variations.
That returns a HUGE dataset that causes the query to be quite slow compared to searches that return fewer items. And also, if I search for Anime
, my first results contain Animal
, Animated
and the first result that has the word Anime
is the 12th one.
Shouldn’t animation
be transformed to animat
in the vector and animal
just be animal
as the other variations for it are animals
or animalia
?
I’ve been searching for a solution to this without much luck, is there any way I can improve this, I’m happy to install extensions, reindex the column or whatever.
Advertisement
Answer
There are so many little details to this. The best solution depends on the exact situation and exact requirements.
Two simple options:
Simple tweak 1
If you want to sort rows where title
or body
have a word starting with ‘Anime’ (exactly) in it, matched case-insensitively, add an ORDER BY
expression like:
ORDER BY unaccent(concat_ws(' ', title, body) !~* ('m' || f_regexp_escape($4)) , (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) DESC
Where the auxiliary function f_regexp_escape()
escapes special regexp characters and is defined here:
That expression is rather expensive, but since it’s only applied to filtered results, the effect is limited. You may have to fine-tune, as other search terms present other difficulties. Think of ‘body’ / ‘bodies’ stemming to ‘bodi’ …
Simple tweak 2
To remove English stemming completely, base yours on the ‘simple’ TEXT SEARCH CONFIGURATION
:
CREATE TEXT SEARCH CONFIGURATION simple_unaccent ( COPY = simple );
Etc.
Then the actual language of the text is irrelevant.The index gets substantially bigger, and the search is done on literal spellings. You can now widen the search with prefix matching like:
WHERE "documentFts" @@ to_tsquery('simple_unaccent', ($1 || ':*')
Again, you’ll have to fine-tune. The simple example only works for single-word patterns. And I doubt you want to get rid of stemming altogether. Probably too radical.
See:
Proper solution: Synonym dictionary
You need access to the installation drive of the Postgres server for this. So typically not possible with most hosted services.
To overrule some of the stemmer decisions, overrule with your own set of synonym(rule)s. Create a mapping file in $SHAREDIR/tsearch_data/my_synonyms.syn
. That’s /usr/share/postgresql/13/tsearch_data/my_synonyms.syn
in my Linux installation:
Let it contain (case insensitive by default):
anime anime
Then:
CREATE TEXT SEARCH DICTIONARY my_synonym ( TEMPLATE = synonym, SYNONYMS = my_synonyms );
There is a chapter with instructions in the manual. One quote:
A synonym dictionary can be used to overcome linguistic problems, for example, to prevent an English stemmer dictionary from reducing the word “Paris” to “pari”. It is enough to have a
Paris paris
line in the synonym dictionary and put it before theenglish_stem
dictionary.
Then:
CREATE TEXT SEARCH CONFIGURATION my_english_unaccent ( COPY = english ); ALTER TEXT SEARCH CONFIGURATION my_english_unaccent ALTER MAPPING FOR hword, hword_part, word WITH unaccent, my_synonym, english_stem; -- added my_synonym!
You have to update your column "documentFts"
with my_english_unaccent
. While being at it, use a proper lower-case column name like document_fts
, and consider a GENERATED
column. See:
- Computed / calculated / virtual / derived columns in PostgreSQL
- Are PostgreSQL column names case-sensitive?
Now, searching for Anime
(or ánime
, for that matter) won’t find animal
any more. And searching for animal
won’t find Anime
.