SAME SQL regexp_extract, different impala and hive output. Why?

Question

The same SQL command has two different output on Hive and Impala: Hive output: ff Impala output: ffff Why such difference? Please explain difference in terms of each engine&#8217;s method of processing and outputting characters space-by-space, from left to right or right to left, step by step, and the reasoni…

Accepted Answer

Update: post asking the Impala community it turns out this was suggested to be a bug, 6 years ago..https://issues.apache.org/jira/browse/IMPALA-2917The suggested workaround is to add a greedy quant to the end of the string to push the .*? as small as possible, though that would reduce its ability to match multiple times in some casesOriginally I wrote:I&#8217;ve read the Cloudera documentation and it&#8217;s just absolute nonsense to me.. The docs sayThis example shows how a pattern string starting with .*? matches the shortest possible portion of the source string, returning the rightmost set of lowercase letters.In my opinion and experience of various Regex engines, the &#8220;shortest possible portion&#8221; search starts from the left and proceeds rightwards trying to make a match, returning the leftmost matching group).  This is in constrast to .* which consumes all and works backwards from the right resulting in the longest possible match and consequently the rightmost set of charactersab12de34fg with .*?d+ is.*? matches empty string, d+ fails on all of: ab12de34fg, ab12de34f, ab12de34, ab12de3, ab12de, ab12d, ab12, ab1, ab, a.*? matches a, d+ fails on all of: b12de34fg, b12de34f, b12de34, b12de3, b12de, b12d, b12, b1, b.*? matches ab, d+ fails on all of: 12de34fg, 12de34f, 12de34, 12de3, 12de, 12dthen d+ succeeds on 12ab12 is matchedmove to match cd34 in the same way---ab12de34fg with .*d+ is.* matches ab12de34fg, d+ fails on empty string.* matches ab12de34f, d+ fails on g.* matches ab12de34, d+ fails on all of: fg, f.* matches ab12de3, d+ fails on all of: 4fg, 4fthen d+ succeeds on 4ab12cd34 is matchedCloudera&#8217;s doc reads like it finds every match and then returns the one with the lowest char count (but that&#8217;s not true for your example, or their next example) so I&#8217;m tempted to say that either their .*? is broken, or I don&#8217;t understand how to conceive what their docs says they search on; I wish I had access to an Impala instance to play about with it some more, wrap the .*? in brackets and see what it matches etc..From the lattermost example in the docs it looks like you can get it to behave like other implementations by putting .*? on the end of the pattern&#8230;..but I&#8217;d be keen to see Cloudera offer a more involved explanation as to why their Regex matching here is unconventional

Advertisement

Answer