I am trying to create a column in a view in Snowflake that replaces any string between strings that I care about with nothing.
This is essentially for the purpose of stripping html formatting out of text. As an example:
<ul> <li>Text I care about 1 <li>Text I care about 2</li> <li>Text I care about 3</li> </ul>
Would should end up like this:
Text I care about 1 Text I care about 2 Text I care about 3
Based on the patterns I am seeing, I think that if I can eliminate any string starting with <, and ending with >, I should be able to achieve the result I am looking for.
In testing on different sites it seems like expression
REGEXP_REPLACE(originaltext, '<.+?>','')
should, work, but when attempting in Snowflake it seems to be cutting off the last ‘Text I care about’ in some cases, and in other cases just is not showing any results at all. I am not sure if there is a syntax difference or something else off in the version of regex snowflake is using, but any advice would be appreciated.
Advertisement
Answer
Your regular expression works, but it requires lookarounds.
set sample1 = '<ul>'; set sample2 = '<li>Text I care about 1'; set sample3 = '<li>Text I care about 2</li>'; set sample4 = '<li>Text I care about 3</li>'; set sample5 = '</ul>'; select regexp_replace2($SAMPLE1,'<.+?>',''); select regexp_replace2($SAMPLE2,'<.+?>',''); select regexp_replace2($SAMPLE3,'<.+?>',''); select regexp_replace2($SAMPLE4,'<.+?>',''); select regexp_replace2($SAMPLE5,'<.+?>','');
I wrote a UDF library that supports regular expression lookarounds. It attempts to approximate the built-in Snowflake regular expression functions while supporting lookarounds. The names of the UDFs are the same as the built-in regular expression functions with the suffix “2” as shown in the SQL sample.
https://github.com/GregPavlik/SnowflakeUDFs/tree/main/RegularExpressions