Segment

What it is

A Lucene index is not a single monolithic file — it is a collection of segments, each of which is a complete, independent mini-index containing its own inverted index, stored fields, DocValues, and term dictionary. Segments are immutable: once written, they are never modified. New documents create new segments; deletions are tracked via a separate deletion bitmap rather than by modifying the segment.

Understanding segments is essential for reasoning about indexing throughput, search latency, and storage costs in Elasticsearch, OpenSearch, and Solr.

How it works

Writing new documents: Documents buffered in memory are periodically flushed to a new segment on disk. Each flush creates a small segment. The number of in-memory documents per flush is controlled by index.translog.flush_threshold_size (Elasticsearch) or maxBufferedDocs (Lucene).

Searching: A search scans all segments and merges results. More segments = more work per query. A single large segment is faster to search than 100 small ones with identical total content.

Deletions: Deleted documents are marked in a .liv (live documents) bitset. The segment itself is not modified. Deleted documents are hidden from search but still consume disk space until the segment is merged.

Merging: The merge policy (TieredMergePolicy by default) periodically merges small segments into larger ones. Merging reclaims deleted document space and reduces segment count, improving search performance. Merging is CPU and I/O intensive.

[illustrate: timeline diagram — documents arrive and accumulate in memory buffer → flush creates small segments (S1, S2, S3) → merge policy combines them into a larger segment (S4) → another batch of small segments (S5, S6) → merge creates S7 — showing how segment count stays bounded over time]

Example

After indexing 10,000 documents with small buffer settings:

Index:
  segment_0: 1000 docs, 50 deleted
  segment_1: 2000 docs, 0 deleted
  segment_2: 3000 docs, 200 deleted
  segment_3: 4000 docs, 0 deleted

A search must open and scan all 4 segments. After a merge:

Index:
  segment_merged: 9750 live docs (250 deletions reclaimed)

One segment to scan — faster search, smaller index.

Variants and history

Lucene’s segment architecture was designed by Doug Cutting and Mike Cafarella and has been central to Lucene since its creation in 1999. The design prioritises write throughput (immutable segments allow lock-free writing) at the cost of read-time segment count management.

Near real-time (NRT) search — in Lucene, a new segment can be made searchable without a full commit by refreshing the index reader. Elasticsearch’s refresh_interval (default 1s) controls how often new segments become visible. This is the mechanism behind Elasticsearch’s “near real-time” search claim.

Force merge — Elasticsearch exposes POST /index/_forcemerge?max_num_segments=1 to collapse the index to a single segment. This maximises read performance but should only be used on read-only indices (merging an active write index can cause issues).

When to use it

Segment awareness matters for:

  • Indexing performance — many small segments from bulk indexing should be followed by an explicit merge or by tuning refresh_interval to a longer window during bulk loads.
  • Search latency — if GET /_cat/segments shows hundreds of segments, a force merge or tuning the merge policy may reduce query latency.
  • Disk space — high delete ratios without merging waste disk. Monitor with _stats.

See also