Skip to content
Advertisement

Oracle SQL regexp_substr number extraction behavior

In a sense I’ve answered my own question, but I’m trying to understand the answer better:

When using regexp_substr (in oracle) to extract the first occurrence of a number (either single or multi digits), how/why do the modifiers * and + impact the results? Why does + provide the behavior I’m looking for and * does not? * is my default usage in most regular expressions so I was surprised it didn’t suit my need.

For example, in the following:

select test, 
regexp_substr(TEST,'d') Pattern1,
regexp_substr(TEST,'d*') Pattern2, 
regexp_substr(TEST,'d+') Pattern3
from (
select '123 W' TEST from dual
union
select 'W 123' TEST from dual
);

the use of regexp_substr(TEST,'d*') returns null value for the input “W 123” – since ‘zero or more’ digits exist in the string, I’m confused by this behavior. I’m also confused why it does work on the string ‘123 W’

my understanding is that * means zero or more occurrences of the element it follows and + means 1 or more occurrence of the preceding element. In the example provided for pattern2 [d*] why does it successfully capture “123” from “123 W” but it does not take 123 from “W 123” as zero or more occurrences of a digit do exist, they just don’t exist in the beginning of the string. Is there additional [implied] logic attached to using *?

Note: I looked around for a while trying to find similar questions that helped me capture the ‘123’ from ‘W 123’ but the closest i found was variations of regexp_replace which would not meet my needs.

Advertisement

Answer

So the regexp_count indicates there are FOUR substrings that match the d* pattern. The third of those is the ‘123’. The implication is that the first and second are derived from the W and space and what you have is a zero length result that ‘consumes’ one character of the source string.

select test, 
    regexp_count(TEST,'d*') Pattern2_c, 
    regexp_substr(TEST,'d*') Pattern2,
    regexp_substr(TEST,'d*',1,1) Pattern2_1,
    regexp_substr(TEST,'d*',1,2) Pattern2_2,
    regexp_substr(TEST,'d*',1,3) Pattern2_3,
    regexp_substr(TEST,'d*',1,4) Pattern2_4
 from (select '123 W' TEST from dual
    union
    select 'W 123' TEST from dual
    );

Oracle has a weird thing about zero length strings and null.

The result doesn’t “feel” right, but then if you ask a computer deep philosophical questions about how many zero length substrings are contained in a string, I wouldn’t bet on any answer.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement