Skip to content
Advertisement

How can I extract all text between single full stops (i.e. ignore `…`) with a re2 regex?

How can I extract all text between single full stops (i.e. ignore ...) with a re2 regex?

I am using the REGEXP_EXTRACT_ALL function in BigQuery which uses https://github.com/google/re2/wiki/Syntax.

From the following example:

This is... a.. sentence. It is just an example.

I would like the query to extract

This is... a.. sentence. and It is just an example.

I am specifically interested in whether this is possible using SQL functions in BigQuery rather than introducing another tool

Advertisement

Answer

Consider below workaround

select text, regexp_replace(sentence, r'(#)(.+)(#)', r'2') sentence
from `project.dataset.table`, 
unnest(split(trim(regexp_replace(regexp_replace(text, r'(.+)', r'#1#'), r'(#.#)', r'####'), '####'), '####')) sentence 

if applied to sample data in your question – output is

enter image description here

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement