Skip to content
Advertisement

REGEX Extract Amount Without Currency

SELECT
    ocr_text,
    bucket,
    REGEXP_EXTRACT('-?[0-9]+(.[0-9]+)?', ocr_text)
FROM temp

I am trying to extract amounts from a string that will not have currency present. Any number that does not have decimals should not match. Commas should be allowed assuming they follow the correct rules (at hundreds marker)

56         no       (missing decimals)
56.45      yes  
120        no       (missing decimals)
120.00     yes
1200.00    yes
1,200.00   yes
1,200      no       (missing decimals)
1200       no       (missing decimals)
134.5      no       (decimal not followed by 2 digits)
23,00.00   no       (invalid comma location)

I’m a noob to REGEX so I know my above statement already does not meet the criteria i’ve listed. However, i’m already stuck getting the error (INVALID_FUNCTION_ARGUMENT) premature end of char-class on my REGEX_EXTRACT line


Can someone point me in the right direction? How can I resolve my current issue? How can I modify to correctly incorporate the other criteria listed?

Advertisement

Answer

Here is a general regex pattern for a positive/negative number with two decimal places and optional thousands comma separators:

(?<!S)(?:-?[0-9]{1,3}(,[0-9]{3})*(.[0-9]{2})|-?[0-9]+(.[0-9]{2}))(?!S)

Demo

Your updated query:

SELECT
    ocr_text,
    bucket,
    REGEXP_EXTRACT(ocr_text, '(?<!S)(?:-?[0-9]{1,3}(,[0-9]{3})*(.[0-9]{2})|-?[0-9]+(.[0-9]{2}))(?!S)')
FROM temp;

From the Presto docs I read, it supposedly supports Java’s regex syntax. In the event that lookarounds are not working, you may try this version:

SELECT
    ocr_text,
    bucket,
    REGEXP_EXTRACT(ocr_text, '(s|^)(?:-?[0-9]{1,3}(,[0-9]{3})*(.[0-9]{2})|-?[0-9]+(.[0-9]{2}))(s|$)')
FROM temp;
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement