Chunking Strategy

What it is

Chunking strategy defines how long documents are split into smaller passages for indexing and retrieval. Document chunks become the units indexed in RAG systems. Strategy choices affect retrieval quality, context preservation, and computational cost. Common strategies: fixed-size windows (e.g., 100 tokens), sentence-based, paragraph-based, or semantic chunking.

[illustrate: Long document split different ways: fixed chunks, sentence-level, semantic chunks; retrieval results for each showing tradeoffs]

How it works

Fixed-size chunks:
- Split at token or byte boundaries (e.g., 512 tokens)
- Overlap windows to preserve context across boundaries
- Fast, simple, but semantically arbitrary
Structural chunking:
- Respect document structure: paragraphs, sections, headings
- Preserves natural discourse units
- Requires parsing; may create variable-size chunks
Semantic chunking:
- Group tokens by semantic similarity or sentence embeddings
- Chunks respect content coherence
- Computationally expensive; requires embedding all text
Hybrid:
- Combine strategies: paragraph chunks with semantic refinement
- Balance efficiency and coherence

Example

# Document: "Alice went to the store. She bought apples and oranges..."

Fixed-size (50 tokens):
  Chunk 1: "Alice went to the store. She bought apples..."
  Chunk 2: "...and oranges. She paid $10..."

Sentence-based:
  Chunk 1: "Alice went to the store."
  Chunk 2: "She bought apples and oranges."
  Chunk 3: "She paid $10."

Semantic:
  Chunk 1: "Alice went to the store. She bought apples and oranges."
  (Same topic: shopping)
  Chunk 2: "She paid $10."
  (Payment)

Variants and history

Document segmentation is old (database IR). TREC-style chunking (fixed size with overlap) is standard for retrieval evaluation. Langchain popularized recursive chunking (split by hierarchy). Semantic chunking gained interest in RAG era (2023+). Topic segmentation (identifying topic boundaries) is related but distinct. Optimal strategy is task/domain-dependent; no universal best practice.

When to use it

Choose chunking strategy considering:

Document length: Long documents need smaller chunks for relevance
Granularity needed: Fine-grained retrieval needs smaller chunks; coarse retrieval ok with larger chunks
Overlap trade-off: Overlap preserves context but increases index size and retrieval cost
Semantic coherence: Semantic chunking preserves meaning but expensive
Trade-off: Smaller chunks enable precise retrieval but provide less context to LLM

Standard: 256–1024 token chunks with 50–100 token overlap. Fine-tune based on your domain and retrieval quality experiments.