The same SQL command has two different output on Hive and Impala:
select regexp_extract('AbcffdBCdeffffGHI','.*?(f+)',1);
Hive output: ff
Impala output: ffff
Why such difference? Please explain difference in terms of each engine’s method of processing and outputting characters space-by-space, from left to right or right to left, step by step, and the reasoning, logic, and engines’ coding. Of course, talking about difference needs to talk about “convention”, too. What is the convention? Which of these output conforms to convention?
The SQL command: select regexp_extract('AbcffdBCdeffffGHI','.*?(f+)',1);
has been executed on Hive and Impala and output obtained as stated.
These places here have been searched and offer No explanation to the question asked.
Hive, https://cwiki.apache.org/confluence/display/Hive/Home
A regular expression extraction and output has been done on: https://onlinetexttools.com/extract-regex-matches-from-text
Result is same as Hive’s. See image: Online Regular Expression Extraction Result
Advertisement
Answer
Update: post asking the Impala community it turns out this was suggested to be a bug, 6 years ago..
https://issues.apache.org/jira/browse/IMPALA-2917
The suggested workaround is to add a greedy quant to the end of the string to push the .*?
as small as possible, though that would reduce its ability to match multiple times in some cases
Originally I wrote:
I’ve read the Cloudera documentation and it’s just absolute nonsense to me.. The docs say
This example shows how a pattern string starting with .*? matches the shortest possible portion of the source string, returning the rightmost set of lowercase letters.
In my opinion and experience of various Regex engines, the “shortest possible portion” search starts from the left and proceeds rightwards trying to make a match, returning the leftmost matching group). This is in constrast to .*
which consumes all and works backwards from the right resulting in the longest possible match and consequently the rightmost set of characters
ab12de34fg with .*?d+ is .*? matches empty string, d+ fails on all of: ab12de34fg, ab12de34f, ab12de34, ab12de3, ab12de, ab12d, ab12, ab1, ab, a .*? matches a, d+ fails on all of: b12de34fg, b12de34f, b12de34, b12de3, b12de, b12d, b12, b1, b .*? matches ab, d+ fails on all of: 12de34fg, 12de34f, 12de34, 12de3, 12de, 12d then d+ succeeds on 12 ab12 is matched move to match cd34 in the same way --- ab12de34fg with .*d+ is .* matches ab12de34fg, d+ fails on empty string .* matches ab12de34f, d+ fails on g .* matches ab12de34, d+ fails on all of: fg, f .* matches ab12de3, d+ fails on all of: 4fg, 4f then d+ succeeds on 4 ab12cd34 is matched
Cloudera’s doc reads like it finds every match and then returns the one with the lowest char count (but that’s not true for your example, or their next example) so I’m tempted to say that either their .*?
is broken, or I don’t understand how to conceive what their docs says they search on; I wish I had access to an Impala instance to play about with it some more, wrap the .*? in brackets and see what it matches etc..
From the lattermost example in the docs it looks like you can get it to behave like other implementations by putting .*?
on the end of the pattern…
..but I’d be keen to see Cloudera offer a more involved explanation as to why their Regex matching here is unconventional