Skip to content
Advertisement

Best approach to index a PostgreSQL database with plain html content?

I have a database with a few million posts, each with a column “content” that contains the post content in plain HTML.

<div class="quoteheader"><a href="http://website.com/message?id=52501">
Quote from: X on October 22, 2013, 02:07:08 PM</a>
</div>
<div class="quote">
Hi, how are you all?
<br></div>
<br>I'm good, how about you?

I want to make a full search tool that allows people to search for a post. In this case, someone could search “how are you” and it would result in this post.

I thought about creating a ts_vector index using gin:

CREATE INDEX posts_content_search ON posts using gin(to_tsvector('simple', content));

To allow those kind of searches.

SELECT * FROM posts WHERE to_tsvector('simple', content) @@ phraseto_tsquery('simple', 'how are you');

However, while creating it, not only it keeps showing a lot of those messages:

DETAIL:  Words longer than 2047 characters are ignored.
NOTICE:  word is too long to be indexed

but it also saves the html tags in the index (e.g: div, b, a, br…), while the best would be to remove the tags and index only the post real content (“hi, how are you all” and “i’m good, how about you”)

What would be the best approach to creating an index to allow such kind of search?

Advertisement

Answer

‘simple’ already excludes the contents of html tags, as can be seen from the output of strip(to_tsvector('simple',content)):

                                                         strip                                                         
-----------------------------------------------------------------------------------------------------------------------
 '02' '07' '08' '2013' '22' 'about' 'all' 'are' 'from' 'good' 'hi' 'how' 'i' 'm' 'october' 'on' 'pm' 'quote' 'x' 'you'

Note the absence of ‘br’, ‘div’, etc.

The “Quote from: X” part is included because that was not in a tag. If you want to exclude that, what logic would you want to use to do so?

The warnings about long words can just be ignored. If you want advice on fixing them, you should show us an example which produces them.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement