DocValues
What it is
DocValues is a column-store data structure used in Lucene (and therefore Elasticsearch, OpenSearch, and Solr) to enable document-centric field access — the inverse of the inverted index. Where the inverted index answers “which documents contain this term?”, DocValues answers “what is the value of field X for document Y?”
DocValues are essential for operations that require per-document field values across many documents: sorting search results by price, computing average rating in an aggregation, or faceting by category.
How it works
When a field has DocValues enabled, Lucene writes its values into a separate column-oriented file at index time. The data is sorted by document ID and compressed using techniques appropriate to the value type:
- Numeric — delta and bit-packing compression.
- Keyword/string — sorted set of unique values with per-doc ordinal references.
- Binary — raw bytes stored per doc.
At query time, DocValues are memory-mapped from disk. The OS page cache keeps hot columns in RAM. This is more efficient than fielddata (the older Elasticsearch approach of loading field values into heap memory) because:
- Off-heap — doesn’t consume JVM heap; less GC pressure.
- Lazy loading — only columns needed for the current request are accessed.
- Columnar layout — sequential disk access when iterating all docs for a sort or aggregation.
[illustrate: side-by-side of inverted index (term → doc list) on the left and DocValues column store (doc → value) on the right — showing how a sort operation scans the DocValues column for “price” across all docs sequentially, without touching the inverted index]
Example
Four documents, price field with DocValues:
| DocID | price |
|---|---|
| 0 | 49.99 |
| 1 | 12.50 |
| 2 | 199.00 |
| 3 | 49.99 |
A sort: price asc request reads this column sequentially → results ordered: D1, D0, D3, D2.
An aggregation computing average price reads all four values → (49.99 + 12.50 + 199.00 + 49.99) / 4 = 77.87.
Both operations require zero interaction with the inverted index.
Variants and history
DocValues were introduced in Lucene 4.0 (2012) as a replacement for the older fielddata mechanism, which loaded field values into JVM heap memory and caused frequent out-of-memory errors on large indexes.
Types of DocValues in Lucene:
| Type | Used for |
|---|---|
NUMERIC |
Integers, longs, floats, doubles |
BINARY |
Per-doc byte arrays |
SORTED |
Low-cardinality keyword fields (single value per doc) |
SORTED_SET |
Multi-value keyword fields |
SORTED_NUMERIC |
Multi-value numeric fields |
In Elasticsearch, DocValues are enabled by default for all non-analysed fields (keyword, numeric, date, geo). They can be disabled to save disk space on fields that will never be sorted or aggregated.
When to use it
DocValues are automatic for most field types in modern Elasticsearch/OpenSearch — you don’t configure them manually unless:
- Disabling to save disk —
"doc_values": falseon keyword fields that are only used for filtering, never sorting or faceting. - Understanding aggregation performance — slow aggregations often indicate a field is using fielddata (legacy, heap-based) instead of DocValues. Check via
GET _stats/fielddata. - Script-based sorting/aggregation — scripts accessing field values use DocValues under the hood via
doc['field']syntax.