Merge Policy

What it is

A merge policy is the algorithm that decides when Lucene should combine multiple small index segments into fewer, larger ones. Because each search query must scan all segments, accumulating too many segments degrades search performance. Merging reduces segment count but consumes CPU and I/O, competing with indexing and search.

The merge policy is one of the key tuning levers for Lucene-based search engines.

How it works

After every flush (new segment creation), Lucene’s merge scheduler checks the current merge policy to determine if any merge should be triggered. The most common policy is TieredMergePolicy (the Lucene default):

TieredMergePolicy organises segments into tiers by size. Within each tier, if more than maxMergeAtOnce (default 10) segments exist, the policy selects the best merge candidate — minimising the size of reclaimed deleted-document space. Segments grow by approximately floorSegmentMB × 10^tier bytes.

Key parameters:

Parameter Default Effect
maxMergeAtOnce 10 Max segments per merge
segmentsPerTier 10 Max segments before a merge is forced
maxMergedSegmentMB 5120 (5 GB) Cap on merged segment size
floorSegmentMB 2 Minimum segment size treated as a full segment

LogByteSizeMergePolicy — an older policy that triggers merges when the total size of small segments exceeds a multiple of the largest segment. Simpler but less efficient than Tiered.

[illustrate: tiered diagram — segments arranged in rows by size tier (S < 2MB, 2-20MB, 20-200MB) — merge arrows showing small segments in the lowest tier being combined into a medium segment, medium segments combining into large — with segment count decreasing from left to right as merges complete]

Example

After heavy bulk indexing, the index has 80 small segments. TieredMergePolicy runs merges in the background, combining groups of 10 into larger segments. After several rounds:

Before: 80 segments × 50MB = 4GB, 80 scatter-gather ops per search
After:  8 segments × 500MB = 4GB, 8 scatter-gather ops per search

Search latency for aggregations typically drops by 5–10× when segment count falls from 80 to 8.

Variants and history

Lucene has shipped several merge policies:

  • TieredMergePolicy (current default, since Lucene 3.2) — tier-based, efficient for mixed-size writes.
  • LogByteSizeMergePolicy — size-based, simpler, still supported.
  • LogDocMergePolicy — count-based (merges when segment count exceeds a threshold).
  • NoMergePolicy — disables merging entirely (useful for read-only indexes built offline).

Elasticsearch exposes merge policy settings under index.merge.policy.*. The most commonly tuned settings are segments_per_tier (lower = more aggressive merging) and max_merged_segment (reduce to limit large merge I/O on SSDs).

When to use it

  • Bulk indexing — disable merging during a large initial load (max_merge_count: 1 or merge throttling), then force-merge at the end.
  • Read-heavy indexes — force-merge to a small number of segments (POST /index/_forcemerge?max_num_segments=1) to minimise search latency. Only safe on indexes that will not receive further writes.
  • Write-heavy indexes — increase segments_per_tier to reduce merge frequency, accepting more segments (and slightly slower search) in exchange for higher write throughput.

See also