HTML Strip

What it is

HTML stripping is the process of removing HTML markup from a text string so that the downstream analysis pipeline sees only human-readable content. It operates at the character level — on the raw byte stream before any tokenisation occurs — making it a character filter, not a token filter.

The distinction matters architecturally. A token filter receives an already-split stream of discrete tokens and discards or transforms individual ones. A character filter operates on the original string in a single pass, rewriting characters before the tokeniser ever runs. In Lucene-based engines (Elasticsearch, OpenSearch, Solr), character filters are declared ahead of the tokeniser in an analyser definition and are executed in that order. HTML stripping must run first, because the tokeniser will otherwise split on angle brackets and produce tokens like "div", "href", and "https" from markup that was never meant to be indexed.

How it works

HTML stripping performs two distinct operations that are often conflated:

Tag removal

A tag-removal pass scans the character stream for < and > delimiters and discards everything between them, inclusive. <strong>important</strong> becomes important. This sounds simple but carries several decisions:

Block-level tags and whitespace. Inline tags like <strong> and <em> wrap text that flows continuously with surrounding content. Block-level tags like <p>, <div>, <h1>, <li>, and <br> mark structural boundaries — the text on each side should be treated as separate units. A naive stripper collapses <p>First paragraph.</p><p>Second paragraph.</p> into First paragraph.Second paragraph. — a single run-on string. A correct implementation replaces block-level tags with a whitespace character (typically a newline or space) rather than nothing, preserving the word boundary. The Lucene HTMLStripCharFilter takes this approach: block-level and certain other tags are substituted with \n.

Script and style blocks. The content of <script> and <style> elements is not human-readable text — it is JavaScript or CSS source. A stripper that removes only tags but leaves their text content will index JavaScript identifiers and CSS property names alongside the document’s actual prose. A correct implementation removes the entire element — opening tag, content, and closing tag — for these two element types. The same applies to <head> content in a full HTML document.

HTML comments. The sequence  is not a tag but is also not indexable content. Comment stripping is separate from tag stripping and must be handled explicitly. Comments can span multiple lines and can contain > characters that would confuse a tag parser looking only for <...> patterns.

CDATA sections. XML-embedded HTML may include <![CDATA[ ... ]]> sections, which can contain raw markup characters without escaping. A stripper that treats < as an unconditional tag-open delimiter will mis-parse CDATA content.

Entity decoding

HTML entity decoding converts entity references to their Unicode character equivalents. There are three forms:

Named entities: & → &, < → <, > → >,   → U+00A0 (non-breaking space), é → é. HTML5 defines over 2 000 named entities.
Decimal numeric references: é → é (U+00E9).
Hexadecimal numeric references: é → é.

Entity decoding must happen after tag removal. If entities are decoded first, a sequence like <script> becomes <script> — a tag that the subsequent pass would then strip, potentially corrupting content that was intentionally escaped. The safe order is: strip tags and comments, then decode entities in the remaining character stream.

After decoding,   (non-breaking space, U+00A0) deserves special handling. It is a common separator in HTML content but is not a regular space character and will not be recognised as whitespace by most tokenisers. Normalising it to a regular space (U+0020) — or running Unicode normalisation afterward — prevents "word word" from tokenising as a single token.

Example

Raw input (an HTML product description snippet):

<div class="desc">
  <h2>Wireless Keyboard &amp; Mouse</h2>
  <p>Compatible with <strong>Windows</strong> &amp; macOS.</p>
  <!-- Promotional copy -->
  <script>trackView('kb-001');</script>
</div>

After HTML stripping:

Wireless Keyboard & Mouse
Compatible with Windows & macOS.

The <script> block and its content are gone entirely. The <h2> and <p> block tags have been replaced with newlines, preserving the word boundary between “Mouse” and “Compatible”. The & entities have decoded to &. The comment has been removed. The tokeniser now receives clean prose.

Variants and history

Lucene HTMLStripCharFilter. The canonical reference implementation for Elasticsearch, OpenSearch, and Solr. It handles tag removal, script and style block stripping, comment stripping, CDATA handling, and all three entity reference forms. Block-level elements (p, div, h1–h6, br, li, tr, td, and others) are replaced with \n. It uses a generated JFlex lexer rather than a regex, making it robust to malformed input that trips up pattern-based approaches.

Regex-based strippers. Many pipelines outside Lucene strip HTML with a regular expression such as /<[^>]+>/g. This works for well-formed HTML with no edge cases but fails silently on: tags containing > inside attribute values (<img alt="a > b">), multi-line tags, CDATA, and script blocks. Regex stripping is acceptable for quick experiments but should not be used in production indexing pipelines.

Parser-based stripping. Libraries like Python’s html.parser, BeautifulSoup, or lxml build a full DOM and extract text nodes. This is the most correct approach and handles malformed markup gracefully, but it is also the most expensive. Parser-based stripping is well-suited to ETL pipelines that preprocess documents before they reach the search engine; it is too slow to run inline inside a per-document analysis chain at index time.

<br> and self-closing tags. XHTML requires explicit self-closing syntax (<br />); HTML5 does not. A stripper must handle both forms without treating the trailing / as a distinct token.

When to use it

Apply HTML stripping as a character filter whenever the content being indexed was authored or stored as HTML — web crawl data, CMS export fields, email bodies with HTML formatting, rich-text editor output. Failing to strip means markup tokens pollute the index and inflate postings lists with noise terms.

Character filter vs preprocessing. For high-throughput pipelines, consider stripping HTML at the document-ingestion stage (ETL) rather than inside the analyser. Storing pre-stripped text means the index cluster never parses markup, and you can apply a full parser-based stripper rather than an inline lexer. The tradeoff is that the stored source loses the original HTML, which matters if you need to re-analyse documents with a changed pipeline without re-crawling the source.

Stripping vs structured extraction. Raw stripping discards potentially valuable structure. An <h1> tag signals high-importance content; an <a href> contains a URL; <title> and <meta name="description"> are document metadata. For precision-oriented search, consider extracting these elements into separate indexed fields (title, body, url, description) during ingestion rather than collapsing everything into a single stripped text field. Boosting the title field at query time is far more effective than treating <title> as equivalent noise to a <div>.

Malformed markup. Real-world HTML from web crawls is frequently malformed — unclosed tags, unescaped < in attribute values, overlapping elements. A lexer-based stripper like HTMLStripCharFilter degrades more gracefully than a regex under these conditions. If your corpus is particularly dirty, a full parser-based ETL step is preferable.

Queries containing HTML. If users can submit queries that contain HTML (e.g. a search box that accepts pasted content), apply the same HTML-strip character filter to the query analyser as to the index analyser. Failing to do so means a query containing & will not match indexed content where that entity was decoded to &.