In a sense I’ve answered my own question, but I’m trying to understand the answer better:
When using regexp_substr (in oracle) to extract the first occurrence of a number (either single or multi digits), how/why do the modifiers * and + impact the results? Why does + provide the behavior I’m looking for and * does not? * is my default usage in most regular expressions so I was surprised it didn’t suit my need.
For example, in the following:
select test, regexp_substr(TEST,'d') Pattern1, regexp_substr(TEST,'d*') Pattern2, regexp_substr(TEST,'d+') Pattern3 from ( select '123 W' TEST from dual union select 'W 123' TEST from dual );
the use of regexp_substr(TEST,'d*') returns null value for the input “W 123” – since ‘zero or more’ digits exist in the string, I’m confused by this behavior. I’m also confused why it does work on the string ‘123 W’
my understanding is that * means zero or more occurrences of the element it follows and + means 1 or more occurrence of the preceding element. In the example provided for pattern2 [d*] why does it successfully capture “123” from “123 W” but it does not take 123 from “W 123” as zero or more occurrences of a digit do exist, they just don’t exist in the beginning of the string. Is there additional [implied] logic attached to using *?
Note: I looked around for a while trying to find similar questions that helped me capture the ‘123’ from ‘W 123’ but the closest i found was variations of regexp_replace which would not meet my needs.
Advertisement
Answer
So the regexp_count indicates there are FOUR substrings that match the d* pattern. The third of those is the ‘123’. The implication is that the first and second are derived from the W and space and what you have is a zero length result that ‘consumes’ one character of the source string.
select test,
regexp_count(TEST,'d*') Pattern2_c,
regexp_substr(TEST,'d*') Pattern2,
regexp_substr(TEST,'d*',1,1) Pattern2_1,
regexp_substr(TEST,'d*',1,2) Pattern2_2,
regexp_substr(TEST,'d*',1,3) Pattern2_3,
regexp_substr(TEST,'d*',1,4) Pattern2_4
from (select '123 W' TEST from dual
union
select 'W 123' TEST from dual
);
Oracle has a weird thing about zero length strings and null.
The result doesn’t “feel” right, but then if you ask a computer deep philosophical questions about how many zero length substrings are contained in a string, I wouldn’t bet on any answer.