Term Vector

What it is

A term vector is a stored per-document summary of what the analysis chain produced for a field. For each document, it records:

  • Which terms appeared (the term text after analysis)
  • The frequency of each term in that document
  • Optionally: the position of each occurrence and the character offsets in the original string

Term vectors are the closest thing to a stored forward index in Lucene. They make it possible to inspect the analysed content of a specific document — answering “what terms does document D42 contain?” without scanning the entire inverted index.

How it works

Term vectors are enabled per-field in the schema:

"body": {
  "type": "text",
  "term_vector": "with_positions_offsets"
}

Options:

Setting Stores
no Nothing (default)
yes Terms and frequencies only
with_positions Terms, frequencies, and token positions
with_offsets Terms, frequencies, and character offsets
with_positions_offsets All of the above

At index time, the term vector for each document is written to a .tvx / .tvd file in the Lucene segment. Retrieval is by document ID.

In Elasticsearch, term vectors for a document can be retrieved via the _termvectors API:

GET /my-index/_termvectors/42?fields=body

This returns each term, its frequency, and positions/offsets if stored.

[illustrate: a document “The quick brown fox” with its term vector shown — a table with columns: term, frequency, positions, start_offset, end_offset — rows: “brown”(1, [2], 10, 15), “fox”(1, [3], 16, 19), “quick”(1, [1], 4, 9), “the”(1, [0], 0, 3)]

Example

Document D1: "the cat sat on the cat"

Term vector (with positions):

Term Freq Positions
cat 2 [1, 5]
on 1 [3]
sat 1 [2]
the 2 [0, 4]

This supports:

  • Highlighting — character offsets pinpoint where to insert <em> tags in the original string.
  • More Like This — high-frequency terms are extracted from the term vector to form a similarity query.

Variants and history

Term vectors have been part of Lucene since early versions. They were the primary mechanism for highlighting before the introduction of the Unified Highlighter (which can reconstruct term positions from the inverted index directly, without requiring stored term vectors).

Real-time term vectors — Elasticsearch supports fetching term vectors at request time, even for fields not configured with stored term vectors. The engine re-analyses the field value on the fly. This is useful for debugging analysis without reindexing.

When to use it

Enable term vectors (with_positions_offsets) when:

  • Using the Fast Vector Highlighter — Elasticsearch’s FVH requires stored term vectors; it is faster than the plain highlighter for large fields.
  • Implementing More Like This — the _termvectors API feeds the MLT query with the most representative terms.
  • Debugging analysis — use _termvectors to verify what the index-time analyzer actually produced for a specific document.

Avoid enabling term vectors on fields where they are not needed — they add non-trivial storage overhead (typically 30–50% of the inverted index size for with_positions_offsets).

See also