DocValues

What it is

DocValues is a column-store data structure used in Lucene (and therefore Elasticsearch, OpenSearch, and Solr) to enable document-centric field access — the inverse of the inverted index. Where the inverted index answers “which documents contain this term?”, DocValues answers “what is the value of field X for document Y?”

DocValues are essential for operations that require per-document field values across many documents: sorting search results by price, computing average rating in an aggregation, or faceting by category.

How it works

When a field has DocValues enabled, Lucene writes its values into a separate column-oriented file at index time. The data is sorted by document ID and compressed using techniques appropriate to the value type:

  • Numeric — delta and bit-packing compression.
  • Keyword/string — sorted set of unique values with per-doc ordinal references.
  • Binary — raw bytes stored per doc.

At query time, DocValues are memory-mapped from disk. The OS page cache keeps hot columns in RAM. This is more efficient than fielddata (the older Elasticsearch approach of loading field values into heap memory) because:

  • Off-heap — doesn’t consume JVM heap; less GC pressure.
  • Lazy loading — only columns needed for the current request are accessed.
  • Columnar layout — sequential disk access when iterating all docs for a sort or aggregation.

[illustrate: side-by-side of inverted index (term → doc list) on the left and DocValues column store (doc → value) on the right — showing how a sort operation scans the DocValues column for “price” across all docs sequentially, without touching the inverted index]

Example

Four documents, price field with DocValues:

DocID price
0 49.99
1 12.50
2 199.00
3 49.99

A sort: price asc request reads this column sequentially → results ordered: D1, D0, D3, D2.

An aggregation computing average price reads all four values → (49.99 + 12.50 + 199.00 + 49.99) / 4 = 77.87.

Both operations require zero interaction with the inverted index.

Variants and history

DocValues were introduced in Lucene 4.0 (2012) as a replacement for the older fielddata mechanism, which loaded field values into JVM heap memory and caused frequent out-of-memory errors on large indexes.

Types of DocValues in Lucene:

Type Used for
NUMERIC Integers, longs, floats, doubles
BINARY Per-doc byte arrays
SORTED Low-cardinality keyword fields (single value per doc)
SORTED_SET Multi-value keyword fields
SORTED_NUMERIC Multi-value numeric fields

In Elasticsearch, DocValues are enabled by default for all non-analysed fields (keyword, numeric, date, geo). They can be disabled to save disk space on fields that will never be sorted or aggregated.

When to use it

DocValues are automatic for most field types in modern Elasticsearch/OpenSearch — you don’t configure them manually unless:

  • Disabling to save disk"doc_values": false on keyword fields that are only used for filtering, never sorting or faceting.
  • Understanding aggregation performance — slow aggregations often indicate a field is using fielddata (legacy, heap-based) instead of DocValues. Check via GET _stats/fielddata.
  • Script-based sorting/aggregation — scripts accessing field values use DocValues under the hood via doc['field'] syntax.

See also