SELECT ocr_text, bucket, REGEXP_EXTRACT('-?[0-9]+(.[0-9]+)?', ocr_text) FROM temp
I am trying to extract amounts from a string that will not have currency present. Any number that does not have decimals should not match. Commas should be allowed assuming they follow the correct rules (at hundreds marker)
56 no (missing decimals) 56.45 yes 120 no (missing decimals) 120.00 yes 1200.00 yes 1,200.00 yes 1,200 no (missing decimals) 1200 no (missing decimals) 134.5 no (decimal not followed by 2 digits) 23,00.00 no (invalid comma location)
I’m a noob to REGEX so I know my above statement already does not meet the criteria i’ve listed. However, i’m already stuck getting the error (INVALID_FUNCTION_ARGUMENT) premature end of char-class
on my REGEX_EXTRACT
line
Can someone point me in the right direction? How can I resolve my current issue? How can I modify to correctly incorporate the other criteria listed?
Advertisement
Answer
Here is a general regex pattern for a positive/negative number with two decimal places and optional thousands comma separators:
(?<!S)(?:-?[0-9]{1,3}(,[0-9]{3})*(.[0-9]{2})|-?[0-9]+(.[0-9]{2}))(?!S)
Your updated query:
SELECT ocr_text, bucket, REGEXP_EXTRACT(ocr_text, '(?<!S)(?:-?[0-9]{1,3}(,[0-9]{3})*(.[0-9]{2})|-?[0-9]+(.[0-9]{2}))(?!S)') FROM temp;
From the Presto docs I read, it supposedly supports Java’s regex syntax. In the event that lookarounds are not working, you may try this version:
SELECT ocr_text, bucket, REGEXP_EXTRACT(ocr_text, '(s|^)(?:-?[0-9]{1,3}(,[0-9]{3})*(.[0-9]{2})|-?[0-9]+(.[0-9]{2}))(s|$)') FROM temp;