Path Hierarchy Tokeniser

Tokenisation Indexing Information-Retrieval Query-Parsing Preprocessing Needs-Review

What it is

A path hierarchy tokeniser takes a delimited path string — a file path, URL segment, or category breadcrumb — and emits every prefix of the full path as a separate token. A document indexed at /usr/local/bin also becomes retrievable via /usr/local and /usr. The result is subtree search: a single query term matches the node itself and every descendant indexed beneath it.

The tokeniser is not general-purpose text splitting. It is purpose-built for hierarchical string structures where containment — “this document lives under this node” — is a first-class query operation.

How it works

The algorithm advances a cursor through the input string. Each time it encounters the delimiter character it emits the substring from the start up to that position. After the final delimiter (or the end of the string) it emits the full path. The result is a sequence of tokens of increasing length — one per level of the hierarchy.

def path_hierarchy_tokens(path: str, delimiter: str = "/") -> list[str]:
    tokens = []
    parts = path.strip(delimiter).split(delimiter)
    for i in range(1, len(parts) + 1):
        tokens.append(delimiter + delimiter.join(parts[:i]))
    return tokens

path_hierarchy_tokens("/usr/local/bin")
# → ['/usr', '/usr/local', '/usr/local/bin']

A leading delimiter is preserved as a path anchor; the root itself (/) is typically omitted because it would match every document in the corpus — analogous to a stop word.

Example

Input path: /electronics/audio/headphones

Token emitted	Represents
`/electronics`	top-level category
`/electronics/audio`	mid-level subcategory
`/electronics/audio/headphones`	leaf node

A user querying /electronics/audio now matches all three levels of product indexed under that subtree — headphones, speakers, DACs — without requiring a prefix query or wildcard at query time. The work is done at index time.

Reverse mode. Some implementations emit suffixes rather than prefixes, starting from the leaf and working upward:

/electronics/audio/headphones
              /audio/headphones
                    /headphones

This makes the leaf the primary match and broader categories secondary — useful when category pages are less important than specific product pages.

Variants and history

Lucene / Elasticsearch / OpenSearch path_hierarchy tokeniser is the canonical implementation. It exposes two parameters:

delimiter — the split character, defaulting to /
reverse — when true, emits suffixes instead of prefixes

The tokeniser was designed primarily for Apache Solr category faceting and later adopted in Elasticsearch. Both expose it as a built-in analysis component requiring no plugins.

URL hierarchies. The same approach applies to URL paths (/blog/2024/march/post-title), though URL tokenisation typically requires stripping query strings and fragments before feeding the path segment to the tokeniser. Some pipelines use a Regex Tokeniser as a preprocessing step to isolate the path component first.

Custom delimiters. Category strings in e-commerce systems often use > (Electronics > Audio > Headphones) or | as separators. The tokeniser works identically — only the delimiter character changes.

When to use it

Use it when:

Documents have a hierarchical category, taxonomy, or file-system location and you need to retrieve all descendants of a given node without wildcard queries.
Faceted navigation must aggregate across parent categories: a facet count for /electronics should include all products in any subcategory beneath it.
You are indexing log file paths, URL structures, or org-chart hierarchies where containment is a meaningful retrieval dimension.

Avoid it when:

The path depth is unbounded or highly irregular. Very deep paths generate many tokens per document; at extreme depths this bloats the index and may trigger field length limits.
You need full substring matching within a path segment rather than prefix matching across segments. A Regex Tokeniser or Edge N-Gram tokeniser is more appropriate for that.
The input is not genuinely hierarchical. Applying this tokeniser to arbitrary slash-separated strings that do not represent containment hierarchies produces tokens with no useful query semantics.