SQL – Remove all HTML tags in a string

In my dataset, I have a field which stores text marked up with HTML. The general format is as follows:

<html><head></head><body><p>My text.</p></body></html>

I could attempt to solve the problem by doing the following:

REPLACE(REPLACE(Table.HtmlData, '<html><head></head><body><p>', ''), '</p></body></html>')

​x
 
REPLACE(REPLACE(Table.HtmlData, '<html><head></head><body><p>', ''), '</p></body></html>')​

However, this is not a strict rule as some of entries break W3C Standards and do not include <head> tags for example. Even worse, there could be missing closing tags. So I would need to include the REPLACE function for each opening and closing tag that could exist.

REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    Table.HtmlData,
    '<html>', ''),
    '</html>', ''),
    '<head>', ''),
    '</head>', ''),
    '<body>', ''),
    '</body>', ''),
    '<p>', ''),
    '</p>', '')

 
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(    Table.HtmlData,    '<html>', ''),    '</html>', ''),    '<head>', ''),    '</head>', ''),    '<body>', ''),    '</body>', ''),    '<p>', ''),    '</p>', '')​

I was wondering if there was a better way to accomplish this than using multiple nested REPLACE functions. Unfortunately, the only languages I have available in this environment are SQL and Visual Basic (not .NET).

Answer

DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'

SELECT t.c.value('.', 'NVARCHAR(MAX)')
FROM @x.nodes('*') t(c)

 
DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'​SELECT t.c.value('.', 'NVARCHAR(MAX)')FROM @x.nodes('*') t(c)​

Update – For strings with unclosed tags:

DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'

SELECT x.value('.', 'NVARCHAR(MAX)')
FROM (
    SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)
) r

 
DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'​SELECT x.value('.', 'NVARCHAR(MAX)')FROM (    SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)) r​

Advertisement

Answer