In my dataset, I have a field which stores text marked up with HTML. The general format is as follows:
<html><head></head><body><p>My text.</p></body></html>
I could attempt to solve the problem by doing the following:
REPLACE(REPLACE(Table.HtmlData, '<html><head></head><body><p>', ''), '</p></body></html>')
However, this is not a strict rule as some of entries break W3C Standards and do not include <head>
tags for example. Even worse, there could be missing closing tags. So I would need to include the REPLACE
function for each opening and closing tag that could exist.
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE( Table.HtmlData, '<html>', ''), '</html>', ''), '<head>', ''), '</head>', ''), '<body>', ''), '</body>', ''), '<p>', ''), '</p>', '')
I was wondering if there was a better way to accomplish this than using multiple nested REPLACE
functions. Unfortunately, the only languages I have available in this environment are SQL and Visual Basic (not .NET).
Advertisement
Answer
DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>' SELECT t.c.value('.', 'NVARCHAR(MAX)') FROM @x.nodes('*') t(c)
Update – For strings with unclosed tags:
DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>' SELECT x.value('.', 'NVARCHAR(MAX)') FROM ( SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML) ) r