Full text search returning too many irrelevant results and causing poor performance

Question

I&#8217;m using the full text search feature from Postgres and for the most part it works fine. I have a column in my database table called documentFts that is basically the ts_vector version of the body field, which is a text column, and that&#8217;s indexed with GIN index. Here&#8217;s my query: The diction…

Accepted Answer

There are so many little details to this. The best solution depends on the exact situation and exact requirements.Two simple options:Simple tweak 1If you want to sort rows where title or body have a word starting with &#8216;Anime&#8217; (exactly) in it, matched case-insensitively, add an ORDER BY expression like:ORDER  BY unaccent(concat_ws(' ', title, body) !~* ('m' || f_regexp_escape($4))        , (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) DESCWhere the auxiliary function f_regexp_escape() escapes special regexp characters and is defined here:Escape function for regular expression or LIKE patternsThat expression is rather expensive, but since it&#8217;s only applied to filtered results, the effect is limited.You may have to fine-tune, as other search terms present other difficulties. Think of &#8216;body&#8217; / &#8216;bodies&#8217; stemming to &#8216;bodi&#8217; &#8230;Simple tweak 2To remove English stemming completely, base yours on the &#8216;simple&#8217; TEXT SEARCH CONFIGURATION:CREATE TEXT SEARCH CONFIGURATION simple_unaccent (  COPY = simple);Etc.Then the actual language of the text is irrelevant.The index gets substantially bigger, and the search is done on literal spellings. You can now widen the search with prefix matching like:WHERE  "documentFts" @@ to_tsquery('simple_unaccent', ($1 || ':*')Again, you&#8217;ll have to fine-tune. The simple example only works for single-word patterns. And I doubt you want to get rid of stemming altogether. Probably too radical.See:Get partial match from GIN indexed TSVECTOR columnProper solution: Synonym dictionaryYou need access to the installation drive of the Postgres server for this. So typically not possible with most hosted services.To overrule some of the stemmer decisions, overrule with your own set of synonym(rule)s. Create a mapping file in $SHAREDIR/tsearch_data/my_synonyms.syn. That&#8217;s /usr/share/postgresql/13/tsearch_data/my_synonyms.syn in my Linux installation:Let it contain (case insensitive by default):anime animeThen:CREATE TEXT SEARCH DICTIONARY my_synonym (    TEMPLATE = synonym,    SYNONYMS = my_synonyms);There is a chapter with instructions in the manual. One quote:A synonym dictionary can be used to overcome linguistic problems, for example, to prevent an English stemmer dictionary from reducing the word “Paris” to “pari”. It is enough to have a Paris paris line in the synonym dictionary and put it before the english_stem dictionary.Then:CREATE TEXT SEARCH CONFIGURATION my_english_unaccent (  COPY = english);ALTER TEXT SEARCH CONFIGURATION my_english_unaccent  ALTER MAPPING FOR hword, hword_part, word  WITH unaccent, my_synonym, english_stem;   -- added my_synonym!You have to update your column "documentFts" with my_english_unaccent. While being at it, use a proper lower-case column name like document_fts, and consider a GENERATED column. See:Computed / calculated / virtual / derived columns in PostgreSQLAre PostgreSQL column names case-sensitive?Now, searching for Anime (or ánime, for that matter) won&#8217;t find animal any more. And searching for animal won&#8217;t find Anime.

Advertisement

Answer

Simple tweak 1

Simple tweak 2

Proper solution: Synonym dictionary