I discovered a very odd behavior of SQL Server Full Text Search which is indexing SUR, SCR and possibly some other acronyms, together with a number following it – as “Exact Match”.
SELECT * FROM sys.dm_fts_parser ('"SUR 12345"', 1033, 0, 0)
keyword | group_id | phrase_id | occurrence | special_term | display_term | expansion_type | source_term |
---|---|---|---|---|---|---|---|
s u r 1 2 3 4 5 | 1 | 0 | 1 | Exact Match | sur 12345 | 0 | SUR 12345 |
n n 1 2 3 4 5 s u r | 1 | 0 | 1 | Exact Match | nn12345sur | 0 | SUR 12345 |
SELECT * FROM sys.dm_fts_parser ('"SCR 12345"', 1033, 0, 0)
keyword | group_id | phrase_id | occurrence | special_term | display_term | expansion_type | source_term |
---|---|---|---|---|---|---|---|
s c r 1 2 3 4 5 | 1 | 0 | 1 | Exact Match | scr 12345 | 0 | SCR 12345 |
n n 1 2 3 4 5 s c r | 1 | 0 | 1 | Exact Match | nn12345scr | 0 | SCR 12345 |
Other acronyms or texts, including lower case sur, are not affected:
SELECT * FROM sys.dm_fts_parser ('"sur 12345"', 1033, 0, 0)
keyword | group_id | phrase_id | occurrence | special_term | display_term | expansion_type | source_term |
---|---|---|---|---|---|---|---|
s u r | 1 | 0 | 1 | Exact Match | sur | 0 | sur 12345 |
1 2 3 4 5 | 1 | 0 | 2 | Exact Match | 12345 | 0 | sur 12345 |
n n 1 2 3 4 5 | 1 | 0 | 2 | Exact Match | nn12345 | 0 | sur 12345 |
SELECT * FROM sys.dm_fts_parser ('"ABC 12345"', 1033, 0, 0)
keyword | group_id | phrase_id | occurrence | special_term | display_term | expansion_type | source_term |
---|---|---|---|---|---|---|---|
a b c | 1 | 0 | 1 | Exact Match | abc | 0 | ABC 12345 |
1 2 3 4 5 | 1 | 0 | 2 | Exact Match | 12345 | 0 | ABC 12345 |
n n 1 2 3 4 5 | 1 | 0 | 2 | Exact Match | nn12345 | 0 | ABC 12345 |
SELECT * FROM sys.dm_fts_parser ('"XYZ 76"', 1033, 0, 0)
keyword | group_id | phrase_id | occurrence | special_term | display_term | expansion_type | source_term |
---|---|---|---|---|---|---|---|
x y z | 1 | 0 | 1 | Exact Match | xyz | 0 | XYZ 76 |
7 6 | 1 | 0 | 2 | Exact Match | 76 | 0 | XYZ 76 |
n n 7 6 | 1 | 0 | 2 | Exact Match | nn76 | 0 | XYZ 76 |
This behavior seems unexpected, most likely buggy but I also might be missing something obvious related to word-breakers (tried 1033 and 2057 – same effect). I reproduced it on SQL Server 2019 Linux 15.0.4053.23 and 2017 CU20 and CU25, to which I had instant access.
Has anyone had similar issue and a solution to it so that SUR, SCR and any other potentially broken acronym will be indexed independently of the following number?
Edit:
Changing language to 0 (Neutral) results in an odd behavior – it does not solve the problem when SUR acronym is used but fixes the SCR acronym!
SELECT * FROM sys.dm_fts_parser ('"SUR 12345"', 0, 0, 0)
keyword | group_id | phrase_id | occurrence | special_term | display_term | expansion_type | source_term |
---|---|---|---|---|---|---|---|
s u r 1 2 3 4 5 | 1 | 0 | 1 | Exact Match | sur 12345 | 0 | SUR 12345 |
n n 1 2 3 4 5 s u r | 1 | 0 | 1 | Exact Match | nn12345sur | 0 | SUR 12345 |
SELECT * FROM sys.dm_fts_parser ('"SCR 12345"', 0, 0, 0)
keyword | group_id | phrase_id | occurrence | special_term | display_term | expansion_type | source_term |
---|---|---|---|---|---|---|---|
s c r | 1 | 0 | 1 | Exact Match | scr | 0 | SCR 12345 |
1 2 3 4 5 | 1 | 0 | 2 | Exact Match | 12345 | 0 | SCR 12345 |
n n 1 2 3 4 5 | 1 | 0 | 2 | Exact Match | nn12345 | 0 | SCR 12345 |
I decided to give this issue a bounty as ideally I need to solve the problem of search term not found by just re-configuring database indexes.
To help with reproducing the problem below is a script to create a database (with commented out DROP script to help with resetting the state)
/* DROP FULLTEXT INDEX ON EnglishTexts DROP FULLTEXT INDEX ON NeutralTexts DROP FULLTEXT CATALOG TestSearchCatalog USE master DROP DATABASE TestSearch */ CREATE DATABASE TestSearch GO USE [TestSearch] GO CREATE FULLTEXT CATALOG TestSearchCatalog WITH ACCENT_SENSITIVITY = OFF GO CREATE TABLE EnglishTexts (Id INT IDENTITY(1,1) NOT NULL, Text NVARCHAR(MAX), CONSTRAINT PK_EnglishTexts PRIMARY KEY CLUSTERED (Id)) CREATE FULLTEXT INDEX ON EnglishTexts (Text LANGUAGE 'English') KEY INDEX PK_EnglishTexts ON ([TestSearchCatalog]) WITH (CHANGE_TRACKING = AUTO, STOPLIST = OFF) INSERT INTO EnglishTexts(Text) VALUES ('PRFX 12233') INSERT INTO EnglishTexts(Text) VALUES ('SUR 12233') INSERT INTO EnglishTexts(Text) VALUES ('SCR 12233') CREATE TABLE NeutralTexts (Id INT IDENTITY(1,1) NOT NULL, Text NVARCHAR(MAX), CONSTRAINT PK_NeutralTexts PRIMARY KEY CLUSTERED (Id)) CREATE FULLTEXT INDEX ON NeutralTexts (Text LANGUAGE 'Neutral') KEY INDEX PK_NeutralTexts ON ([TestSearchCatalog]) WITH (CHANGE_TRACKING = AUTO, STOPLIST = OFF) INSERT INTO NeutralTexts(Text) VALUES ('PRFX 12233') INSERT INTO NeutralTexts(Text) VALUES ('SUR 12233') INSERT INTO NeutralTexts(Text) VALUES ('SCR 12233') -- following query returns 1 row but should 3 - a possible bug in english word breaker SELECT * FROM EnglishTexts WHERE CONTAINS(Text, '"12233"') -- following query returns 2 rows but should 3 - neutral language word breaker is also treating SUR acronym specially - another bug? SELECT * FROM NeutralTexts WHERE CONTAINS(Text, '"12233"') -- following query returns 1 row but should 3 - forcing neutral language on a query on english index should apply neutral language (i might misunderstand if this is even possible without a neutral index) SELECT * FROM EnglishTexts WHERE CONTAINS(Text, '"12233"', LANGUAGE 0) -- following query returns 2 rows but should 3 - using neutral language on neutral language indexed table should not make a difference SELECT * FROM NeutralTexts WHERE CONTAINS(Text, '"12233"', LANGUAGE 0) -- for reference - English word breaker does not split SCR with 12233 and SUR with 12233, causing above problems SELECT * FROM sys.dm_fts_parser ('"SCR 12233 SUR 12233"', 1033, 0, 0) -- for reference - Neutral word breaker correctly splits SCR and 12233 but not SUR with 12233 SELECT * FROM sys.dm_fts_parser ('"SCR 12233 SUR 12233"', 0, 0, 0)
Advertisement
Answer
Finally I was able to determine that the issue is related to a currency symbol (apparently SUR and SCR are currency symbols) followed or preceded by a number, causes both to be indexed together.
In my opinion this might be a desired behaviour only if user expects past (SUR – Soviet Ruble, not in use since 1993) or current (SCR – Seychelles Rupee) currencies to be present in text and only if the currency symbol follows or precedes the number according to standards (for example $ precedes the number, SCR or € follows the number).
Moreover, currency symbols seem to be partially affecting Neutral language breaker – past currencies like SUR are fine but current currencies affecting language-neutral word breaking is an entirely unexpected behaviour considering language neutral text processing should not be affected by any dictionary words.
Microsoft documentation of SQL Server 2012 and up FTS text processing explains relevant changes to a word breaker, showing that a new word breaker does not index neither currency symbol or a number separately, even in a language neutral word-breaker:
term | previous | new |
---|---|---|
100$ | 100$ | 100$ |
100$ | nn100 | nn100usd |
$100 000 USD | $100 | $100 000 usd |
$100 000 USD | 000 | |
$100 000 USD | nn000 | |
$100 000 USD | nn100$ | |
$100 000 USD | usd |
The only solution to fix the original problem is to revert to a pre-2012 word-breaker and stemmer as described here. The solution involves several steps to change the following registry keys (save as .reg file and open to apply, applies to a default instance on SQL Server 2017 – MSSQL14.MSSQLSERVER
– change it to your instance directory name in C:Program FileMicrosoft SQL Server
):
Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft SQL ServerMSSQL14.MSSQLSERVERMSSearchLanguageenu] "WBreakerClass"="{188D6CC5-CB03-4C01-912E-47D21295D77E}" "StemmerClass"="{EEED4C20-7F1B-11CE-BE57-00AA0051FE20}" [HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft SQL ServerMSSQL14.MSSQLSERVERMSSearchCLSID{188D6CC5-CB03-4C01-912E-47D21295D77E}] @="langwrbk.dll" [HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft SQL ServerMSSQL14.MSSQLSERVERMSSearchCLSID{EEED4C20-7F1B-11CE-BE57-00AA0051FE20}] @="infosoft.dll"
After making registry changes, SQL Server needs to be restarted and FULLTEXT INDEX objects recreated (DROP + CREATE FILLTEXT INDEX ON...
) for changes to take effect.
To revert to original word breaker and stemmer, use the following registry key:
Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft SQL ServerMSSQL14.MSSQLSERVERMSSearchLanguageenu] "WBreakerClass"="{9FAED859-0B30-4434-AE65-412E14A16FB8}" "StemmerClass"="{E1E5EF84-C4A6-4E50-8188-99AEF3DE2659}"
There are obviously downsides of using the old version of word breaker but at least currency symbols are indexed separately from a numeric values surrounding them.
I would like to add that I reported this problem to Microsoft Support and it ended up being classified as expected and desired behaviour, with no ability to fix it other than using an old word-breaker.
Inflexibility of SQL Server in handling terms like SUR, which in my domain refers to a Surgery instead of Seychelles Rupee, led me to initiating a migration of our products to PostgreSQL, to be completed in the next 6 months.