Why is SQL Server Full Text Search indexing SCR or SUR acronym followed by a number, together?

Question

I discovered a very odd behavior of SQL Server Full Text Search which is indexing SUR, SCR and possibly some other acronyms, together with a number following it &#8211; as &#8220;Exact Match&#8221;. keyword group_id phrase_id occurrence special_term display_term expansion_type source_term s u r 1 2 3 4 5 1 0 …

Accepted Answer

Finally I was able to determine that the issue is related to a currency symbol (apparently SUR and SCR are currency symbols) followed or preceded by a number, causes both to be indexed together.In my opinion this might be a desired behaviour only if user expects past (SUR &#8211; Soviet Ruble, not in use since 1993) or current (SCR &#8211; Seychelles Rupee) currencies to be present in text and only if the currency symbol follows or precedes the number according to standards (for example $ precedes the number, SCR or € follows the number).Moreover, currency symbols seem to be partially affecting Neutral language breaker &#8211; past currencies like SUR are fine but current currencies affecting language-neutral word breaking is an entirely unexpected behaviour considering language neutral text processing should not be affected by any dictionary words.Microsoft documentation of SQL Server 2012 and up FTS text processing explains relevant changes to a word breaker, showing that a new word breaker does not index neither currency symbol or a number separately, even in a language neutral word-breaker:termpreviousnew100$100$100$100$nn100nn100usd$100 000 USD$100$100 000 usd$100 000 USD000$100 000 USDnn000$100 000 USDnn100$$100 000 USDusdThe only solution to fix the original problem is to revert to a pre-2012 word-breaker and stemmer as described here. The solution involves several steps to change the following registry keys (save as .reg file and open to apply, applies to a default instance on SQL Server 2017 &#8211; MSSQL14.MSSQLSERVER &#8211; change it to your instance directory name in C:Program FileMicrosoft SQL Server):Windows Registry Editor Version 5.00[HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft SQL ServerMSSQL14.MSSQLSERVERMSSearchLanguageenu]"WBreakerClass"="{188D6CC5-CB03-4C01-912E-47D21295D77E}""StemmerClass"="{EEED4C20-7F1B-11CE-BE57-00AA0051FE20}"[HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft SQL ServerMSSQL14.MSSQLSERVERMSSearchCLSID{188D6CC5-CB03-4C01-912E-47D21295D77E}]@="langwrbk.dll"[HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft SQL ServerMSSQL14.MSSQLSERVERMSSearchCLSID{EEED4C20-7F1B-11CE-BE57-00AA0051FE20}]@="infosoft.dll"After making registry changes, SQL Server needs to be restarted and FULLTEXT INDEX objects recreated (DROP + CREATE FILLTEXT INDEX ON...) for changes to take effect.To revert to original word breaker and stemmer, use the following registry key:Windows Registry Editor Version 5.00[HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft SQL ServerMSSQL14.MSSQLSERVERMSSearchLanguageenu]"WBreakerClass"="{9FAED859-0B30-4434-AE65-412E14A16FB8}""StemmerClass"="{E1E5EF84-C4A6-4E50-8188-99AEF3DE2659}"There are obviously downsides of using the old version of word breaker but at least currency symbols are indexed separately from a numeric values surrounding them.I would like to add that I reported this problem to Microsoft Support and it ended up being classified as expected and desired behaviour, with no ability to fix it other than using an old word-breaker.Inflexibility of SQL Server in handling terms like SUR, which in my domain refers to a Surgery instead of Seychelles Rupee, led me to initiating a migration of our products to PostgreSQL, to be completed in the next 6 months.

keyword	group_id	phrase_id	occurrence	special_term	display_term	expansion_type	source_term
s u r 1 2 3 4 5	1	0	1	Exact Match	sur 12345	0	SUR 12345
n n 1 2 3 4 5 s u r	1	0	1	Exact Match	nn12345sur	0	SUR 12345

keyword	group_id	phrase_id	occurrence	special_term	display_term	expansion_type	source_term
s c r 1 2 3 4 5	1	0	1	Exact Match	scr 12345	0	SCR 12345
n n 1 2 3 4 5 s c r	1	0	1	Exact Match	nn12345scr	0	SCR 12345

keyword	group_id	occurrence	special_term	display_term	source_term
s u r	1	1	Exact Match	sur	sur 12345
1 2 3 4 5	1	2	Exact Match	12345	sur 12345
n n 1 2 3 4 5	1	2	Exact Match	nn12345	sur 12345

keyword	group_id	occurrence	special_term	display_term	source_term
a b c	1	1	Exact Match	abc	ABC 12345
1 2 3 4 5	1	2	Exact Match	12345	ABC 12345
n n 1 2 3 4 5	1	2	Exact Match	nn12345	ABC 12345

keyword	group_id	occurrence	special_term	display_term	source_term
x y z	1	1	Exact Match	xyz	XYZ 76
7 6	1	2	Exact Match	76	XYZ 76
n n 7 6	1	2	Exact Match	nn76	XYZ 76

keyword	group_id	occurrence	special_term	display_term	source_term
s c r	1	1	Exact Match	scr	SCR 12345
1 2 3 4 5	1	2	Exact Match	12345	SCR 12345
n n 1 2 3 4 5	1	2	Exact Match	nn12345	SCR 12345

term	previous	new
100$	100$	100$
100$	nn100	nn100usd
$100 000 USD	$100	$100 000 usd
$100 000 USD	000
$100 000 USD	nn000
$100 000 USD	nn100$
$100 000 USD	usd

Advertisement

Answer