I have a SQL Server table with a column of type TEXT
that would store candidate resumes in different format. RTF is the most common one but often we get resume data from a 3rd party converter which stores the resume as special characters (maybe Unicode or I don’t know what they are).
How do I search my table to find all the rows that have these special characters? For example the rows with id = 4,6,7, 9 etc. all are the records with special characters.
What format are these special characters called? Unicode??
Advertisement
Answer
Assuming that by “special” characters you mean anything outside the set of printable ASCII and certain common whitespace characters , you can try the following:
DECLARE @SpecialPattern VARCHAR(100) = '%[^' + CHAR(9) + CHAR(10) + CHAR(13) -- tab, CR, LF + CHAR(32) + '-' + CHAR(126) -- Range from space to last printable ASCII + ']%' SELECT RESUME_TEXT, cast(left(cast(resume_text as varchar(max)),20) as varbinary(max))` -- Borrowed from userMT's comment FROM RESUME WHERE RESUME_TEXT LIKE @SpecialPattern COLLATE Latin1_General_Bin -- Use exact compare
You may get some false hits against some perfectly valid extended characters such as accented vowels, curly quotes, or m- and n- dashes that may exist in the text.
My first though is that the weird characters might be a UTF-8 BOM (hex EF, BB, BF), but the display didn’t seem to match the how I would expect SQL Server to render them. The inverse dot isn’t present at all in the default windows code page (1252).
We need at least some hex data (at least the first few bytes) to help further. Often, common binary file types have a recognizable signature in the first 3-5 bytes.