Commit

Indexing Information-Retrieval Needs-Review

What it is

A commit is the operation that makes recently indexed documents durable — persisted to disk in a way that survives a crash or restart. In Lucene and Elasticsearch, “commit” specifically means writing a new commit point that includes all current segments and clearing the write-ahead transaction log.

Commit must be distinguished from refresh: a refresh makes documents visible to search (near real-time) without durability guarantees. A commit makes them durable but is more expensive.

How it works

Lucene’s write path has three stages:

Buffer — documents accumulate in an in-memory index buffer.
Refresh — the buffer is flushed to a new, searchable segment in the OS page cache. Documents are now visible to search but not yet on durable storage. Elasticsearch refresh_interval defaults to 1 second.
Commit (flush) — all dirty segments are fsync’d to disk; a new segments_N commit point file is written listing all current segments; the transaction log is cleared. This is the true durability boundary.

In Elasticsearch, there are two types of commit:

Hard commit (flush): Calls fsync, writes the Lucene commit point, clears the translog. Triggered by POST /index/_flush or automatically when the translog reaches a size/age threshold (index.translog.flush_threshold_size, default 512MB).

Soft commit (refresh): Makes segments searchable without syncing to disk. Triggered by POST /index/_refresh or the refresh parameter on index requests.

[illustrate: timeline showing three stages — documents enter memory buffer (stage 1) → refresh creates searchable segment in page cache (stage 2, documents appear in search) → flush writes fsync’d commit point and clears translog (stage 3, documents survive crash) — with crash recovery arrow showing that data between last flush and crash is replayed from the translog]

Example

Indexing workflow for a bulk load:

Disable automatic refresh (refresh_interval: -1) to avoid creating many small segments.
Bulk index all documents.
Call POST /index/_refresh to make all documents searchable.
Call POST /index/_flush to make them durable and clear the translog.

This pattern maximises indexing throughput by avoiding per-document or per-second refresh overhead during the load.

Variants and history

Lucene’s commit design follows write-ahead logging (WAL) patterns standard in databases. The translog in Elasticsearch is functionally equivalent to a WAL: in the event of a crash before a hard commit, the translog is replayed to recover the un-flushed documents.

In Elasticsearch 7.0+, the translog was changed from synchronous fsync per document to asynchronous fsync (configurable via index.translog.durability), improving write throughput at the cost of a small risk of losing the last second of writes on crash.

When to use it

Explicit flush after bulk indexing — after a large bulk load, call _flush to clear the translog and ensure durability.
Tuning refresh_interval — increase to 30s or -1 during heavy indexing; reduce to 1s (or less) for near real-time search requirements.
Monitoring translog size — a large translog indicates uncommitted data and slow flush times. Check via GET /_cat/indices?v (the pri.store.size column includes translog).