Forward Index
What it is
A forward index is a data structure that maps document IDs to terms. For every document in a corpus, the forward index records which terms appear in it — and optionally their frequencies and positions. It is the complement of the inverted index, which maps terms to documents.
The forward index is the natural intermediate structure produced during document ingestion: you read a document, tokenise it, and accumulate a per-document term list. Building the inverted index is then a matter of inverting this mapping.
How it works
During indexing, each incoming document is processed through the analysis chain. The forward index accumulates one entry per document:
doc_id → [(term, frequency, [positions]), ...]
For example, given two documents:
- D1:
"the cat sat on the mat" - D2:
"the dog sat on the log"
The forward index is:
D1 → [("cat", 1, [1]), ("mat", 1, [5]), ("on", 1, [3]), ("sat", 1, [2]), ("the", 2, [0, 4])]
D2 → [("dog", 1, [1]), ("log", 1, [5]), ("on", 1, [3]), ("sat", 1, [2]), ("the", 2, [0, 4])]
To build the inverted index, the engine iterates all forward index entries and collects every document referencing each term.
[illustrate: two documents on the left mapping via arrows to their per-document term lists on the right — each entry showing term + frequency + positions — with an arrow labelled “invert” pointing to an inverted index column on the far right]
Example
A simplified forward index for a three-document corpus:
| Doc | Terms (with freq) |
|---|---|
| D1 | the(2), cat(1), sat(1), on(1), mat(1) |
| D2 | the(2), dog(1), sat(1), on(1), log(1) |
| D3 | the(1), cat(1), chased(1), dog(1) |
This can be inverted to produce an inverted index where "cat" → [D1, D3], "dog" → [D2, D3], etc.
Variants and history
In the early Lucene architecture, the forward index was not stored explicitly — only the inverted index was kept, and per-document term access required storing term vectors separately. Lucene 4+ introduced DocValues as a column-oriented complement enabling efficient forward-direction access for sorting and faceting.
Modern search engines use various forward-index structures:
- Term Vectors — per-document stored term lists, used for highlighting and More Like This queries.
- DocValues — column-store for numeric, keyword, and geo fields; optimised for sorting and aggregation, not text.
- Stored fields — verbatim copies of the original field values, used to return source content in results.
None of these is a classical forward index in the IR textbook sense, but together they cover the same access patterns.
When to use it
You don’t build a forward index manually in Elasticsearch, Solr, or OpenSearch — the engine manages this internally. The concepts matter when:
- Debugging analysis — understanding what terms a document contains after analysis requires forward-index access; the
_termvectorsAPI in Elasticsearch exposes this. - Building custom indexes — if writing your own search engine or indexing layer, the forward index is the natural first pass before inversion.
- Document reconstruction — stored fields provide the verbatim original; term vectors provide the analysed form.