Skip to content
Advertisement

MySql regexp on column with html

I have a table with HTML content in which I need to select all rows containing a span element with the CSS class “glyphicon glyphicon-film” with some text content, but exclude all rows without content or just containing a white space.

These rows should NOT match:

<span class="glyphicon glyphicon-film">&nbsp;</span>
<span class="glyphicon glyphicon-film"></span>
<span class="glyphicon glyphicon-film"> </span>
<span class="hypo-link-target" data-link-target-id="1379">Hammartå</span>

These rows should match:

<span class="glyphicon glyphicon-film">nån text</span>
<span class="glyphicon glyphicon-film">Nån Text</span>
<span class="glyphicon glyphicon-film">NÅN TEXT</span>
<span class="glyphicon glyphicon-film">&nbsp;nån text</span>
<span class="glyphicon glyphicon-film">&nbsp;Nån Text</span>
<span class="glyphicon glyphicon-film">&nbsp; nån text</span>
<span class="glyphicon glyphicon-film">&nbsp; Nån Text</span>
<span class="glyphicon glyphicon-film"> nån text</span>
<span class="glyphicon glyphicon-film"> Nån Text</span>

The query I have at the moment:

select * from Section 
where (Text regexp '(span class="glyphicon glyphicon-film">&nbsp;.+</span>)' 
or Text regexp 'span class="glyphicon glyphicon-film">[a-zA-Z][|å|ä|ö|Å|Ä|Ö]+</span>');

Which regexp should I use to get the desired result?

UPDATE: Based on suggestions in the answer I ended up with this query which I think works for all cases. This query will also match on rows with two or more span elements where some elements are correct and some are not, like this one:

<span class="glyphicon glyphicon-film">&nbsp;Nån Text</span>more content here... and a correct span <span class="glyphicon glyphicon-film">&nbsp;</span>

select * from Section 
where Text regexp 'span class="glyphicon glyphicon-film">&nbsp;.+</span>' 
        or (Text regexp 'span class="glyphicon glyphicon-film">[a-zA-ZåäöÅÄÖ ]+</span>'
and Text not regexp 'span class="glyphicon glyphicon-film"> </span>');

Advertisement

Answer

The pattern [a-zA-Z][|å|ä|ö|Å|Ä|Ö] matches a string beginning with an ASCII letter, followed by any mixture of |åäöÅÄÖ. You want to match a string of any combination of these characters, as well as spaces, which you left out. That would be [a-zA-ZåäöÅÄÖ ]+.

| is not used inside [] to specify alternatives, since [] is already specifying alternative characters.

select * from Section 
where (Text regexp 'span class="glyphicon glyphicon-film">&nbsp;.+</span>' 
        or Text regexp 'span class="glyphicon glyphicon-film">[a-zA-ZåäöÅÄÖ ]+</span>')
and Text not regexp 'span class="glyphicon glyphicon-film"> +</span>';

The last condition is needed to prevent <span class="glyphicon glyphicon-film"> </span> from matching.

DEMO

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement