Term Vector
What it is
A term vector is a stored per-document summary of what the analysis chain produced for a field. For each document, it records:
- Which terms appeared (the term text after analysis)
- The frequency of each term in that document
- Optionally: the position of each occurrence and the character offsets in the original string
Term vectors are the closest thing to a stored forward index in Lucene. They make it possible to inspect the analysed content of a specific document — answering “what terms does document D42 contain?” without scanning the entire inverted index.
How it works
Term vectors are enabled per-field in the schema:
"body": {
"type": "text",
"term_vector": "with_positions_offsets"
}
Options:
| Setting | Stores |
|---|---|
no |
Nothing (default) |
yes |
Terms and frequencies only |
with_positions |
Terms, frequencies, and token positions |
with_offsets |
Terms, frequencies, and character offsets |
with_positions_offsets |
All of the above |
At index time, the term vector for each document is written to a .tvx / .tvd file in the Lucene segment. Retrieval is by document ID.
In Elasticsearch, term vectors for a document can be retrieved via the _termvectors API:
GET /my-index/_termvectors/42?fields=body
This returns each term, its frequency, and positions/offsets if stored.
[illustrate: a document “The quick brown fox” with its term vector shown — a table with columns: term, frequency, positions, start_offset, end_offset — rows: “brown”(1, [2], 10, 15), “fox”(1, [3], 16, 19), “quick”(1, [1], 4, 9), “the”(1, [0], 0, 3)]
Example
Document D1: "the cat sat on the cat"
Term vector (with positions):
| Term | Freq | Positions |
|---|---|---|
| cat | 2 | [1, 5] |
| on | 1 | [3] |
| sat | 1 | [2] |
| the | 2 | [0, 4] |
This supports:
- Highlighting — character offsets pinpoint where to insert
<em>tags in the original string. - More Like This — high-frequency terms are extracted from the term vector to form a similarity query.
Variants and history
Term vectors have been part of Lucene since early versions. They were the primary mechanism for highlighting before the introduction of the Unified Highlighter (which can reconstruct term positions from the inverted index directly, without requiring stored term vectors).
Real-time term vectors — Elasticsearch supports fetching term vectors at request time, even for fields not configured with stored term vectors. The engine re-analyses the field value on the fly. This is useful for debugging analysis without reindexing.
When to use it
Enable term vectors (with_positions_offsets) when:
- Using the Fast Vector Highlighter — Elasticsearch’s FVH requires stored term vectors; it is faster than the plain highlighter for large fields.
- Implementing More Like This — the
_termvectorsAPI feeds the MLT query with the most representative terms. - Debugging analysis — use
_termvectorsto verify what the index-time analyzer actually produced for a specific document.
Avoid enabling term vectors on fields where they are not needed — they add non-trivial storage overhead (typically 30–50% of the inverted index size for with_positions_offsets).