Skip to content
Advertisement

Oracle SQL regexp_substr number extraction behavior

In a sense I’ve answered my own question, but I’m trying to understand the answer better:

When using regexp_substr (in oracle) to extract the first occurrence of a number (either single or multi digits), how/why do the modifiers * and + impact the results? Why does + provide the behavior I’m looking for and * does not? * is my default usage in most regular expressions so I was surprised it didn’t suit my need.

For example, in the following:

the use of regexp_substr(TEST,'d*') returns null value for the input “W 123” – since ‘zero or more’ digits exist in the string, I’m confused by this behavior. I’m also confused why it does work on the string ‘123 W’

my understanding is that * means zero or more occurrences of the element it follows and + means 1 or more occurrence of the preceding element. In the example provided for pattern2 [d*] why does it successfully capture “123” from “123 W” but it does not take 123 from “W 123” as zero or more occurrences of a digit do exist, they just don’t exist in the beginning of the string. Is there additional [implied] logic attached to using *?

Note: I looked around for a while trying to find similar questions that helped me capture the ‘123’ from ‘W 123’ but the closest i found was variations of regexp_replace which would not meet my needs.

Advertisement

Answer

So the regexp_count indicates there are FOUR substrings that match the d* pattern. The third of those is the ‘123’. The implication is that the first and second are derived from the W and space and what you have is a zero length result that ‘consumes’ one character of the source string.

Oracle has a weird thing about zero length strings and null.

The result doesn’t “feel” right, but then if you ask a computer deep philosophical questions about how many zero length substrings are contained in a string, I wouldn’t bet on any answer.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement