I’ve been using this site for longer than I can remember, but I’ve never had to ask a question before – I’ve always been able to find what I need – but this one could use some SQL expertise for the best approach since it’s kind of a worst case.
I have a text comment field that gets a lot of stuff put in it in no consistent format – it’ll include a patient name, hopefully a record ID (which could come from 2 different tables differentiated by the range), spaces, slashes, phone numbers, the initials of whoever entered it, etc. I’ll scrub some examples… I need to find the numeric RecordID in here (it references another disconnected system) and ignore the rest. There’s a remote possibility it could have 2 record numbers (could be referenced in two different tables) – if so, I need to pick the more important one.
Example Data (scrubbed):
Sanchez, Raul – POS – OS – 489-849-7894 – AB – Conf >> Nothing of value here/return NULL
Smith, Nancy RL 1458968-rl >> Return 1458968
Sudhu, Mandeep intake # 78596/rl >> Return 78596
1887569 Smith, Jonathan-ESK >> Return 1021015
Montana, Joe/1451252rm >> Return 1451252
Johnson, Fredy/1784569/78563/RG >> Return 1784569
My first thought was to do a bunch of nested Replace() functions to turn anything that’s not a character into a standard delimiter, parse these all into separate records with split_string(), then only pick the largest number – and that would work in every instance except #5 above… but is that the best way, or is there something better? SQL Version is 2017 for reference to available functions.
This will be wrapped into a scalar valued function.
Thanks!
Advertisement
Answer
Here is a way to extract all numbers.
t1 identify all non-digit characters in each text.
t2 replaces all non-digit characters with spaces.
t3 assumes that numbers might be separated by an unknown number of spaces, and replaces every sequence of spaces with a single space.
Enjoy 🙂
with t1 as ( select record_id ,txt ,trim(translate(txt, '1234567890', space(10))) as non_digits from t ), t2 as ( select record_id ,trim(translate(txt, non_digits, space(len(non_digits)))) as spaced_numberes from t1 ), t3 as ( select record_id ,replace(replace(replace(spaced_numberes, ' ', ' *'), '* ', ''), '*', '') as space_delimited_numbers from t2 ) select t3.record_id ,ss.value from t3 cross apply string_split(space_delimited_numbers, ' ') as ss
record_id | value |
---|---|
1 | 489 |
1 | 849 |
1 | 7894 |
2 | 1458968 |
3 | 78596 |
4 | 1887569 |
5 | 1451252 |
6 | 1784569 |
6 | 78563 |