Chunking Strategy
What it is
Chunking strategy defines how long documents are split into smaller passages for indexing and retrieval. Document chunks become the units indexed in RAG systems. Strategy choices affect retrieval quality, context preservation, and computational cost. Common strategies: fixed-size windows (e.g., 100 tokens), sentence-based, paragraph-based, or semantic chunking.
[illustrate: Long document split different ways: fixed chunks, sentence-level, semantic chunks; retrieval results for each showing tradeoffs]
How it works
-
Fixed-size chunks:
- Split at token or byte boundaries (e.g., 512 tokens)
- Overlap windows to preserve context across boundaries
- Fast, simple, but semantically arbitrary
-
Structural chunking:
- Respect document structure: paragraphs, sections, headings
- Preserves natural discourse units
- Requires parsing; may create variable-size chunks
-
Semantic chunking:
- Group tokens by semantic similarity or sentence embeddings
- Chunks respect content coherence
- Computationally expensive; requires embedding all text
-
Hybrid:
- Combine strategies: paragraph chunks with semantic refinement
- Balance efficiency and coherence
Example
# Document: "Alice went to the store. She bought apples and oranges..."
Fixed-size (50 tokens):
Chunk 1: "Alice went to the store. She bought apples..."
Chunk 2: "...and oranges. She paid $10..."
Sentence-based:
Chunk 1: "Alice went to the store."
Chunk 2: "She bought apples and oranges."
Chunk 3: "She paid $10."
Semantic:
Chunk 1: "Alice went to the store. She bought apples and oranges."
(Same topic: shopping)
Chunk 2: "She paid $10."
(Payment)
Variants and history
Document segmentation is old (database IR). TREC-style chunking (fixed size with overlap) is standard for retrieval evaluation. Langchain popularized recursive chunking (split by hierarchy). Semantic chunking gained interest in RAG era (2023+). Topic segmentation (identifying topic boundaries) is related but distinct. Optimal strategy is task/domain-dependent; no universal best practice.
When to use it
Choose chunking strategy considering:
- Document length: Long documents need smaller chunks for relevance
- Granularity needed: Fine-grained retrieval needs smaller chunks; coarse retrieval ok with larger chunks
- Overlap trade-off: Overlap preserves context but increases index size and retrieval cost
- Semantic coherence: Semantic chunking preserves meaning but expensive
- Trade-off: Smaller chunks enable precise retrieval but provide less context to LLM
Standard: 256–1024 token chunks with 50–100 token overlap. Fine-tune based on your domain and retrieval quality experiments.