Context Window

What it is

The context window (or context length, sequence length) is the maximum number of tokens a language model can process simultaneously in a single forward pass. Standard transformers have 512 token context (BERT), while modern LLMs range from 2k–128k tokens. Longer context windows allow models to incorporate more relevant information but increase computational cost (O(n^2) for full attention).

[illustrate: Transformer with context window showing maximum sequence length; examples of text that fit vs. exceed context window]

How it works

Architecture constraint: Positional encodings or relative position embeddings define position bounds
Attention complexity: Full self-attention is O(n^2) in sequence length
- 512 tokens: ~260k attention operations
- 4k tokens: ~16M attention operations
- 128k tokens: ~16B attention operations
Inference cost: Processing longer contexts takes proportionally more time and memory
Extrapolation:
- Most models don’t extrapolate beyond training context length
- Positional interpolation or rotation embeddings help extend length
- Sparse attention (local windows, learned patterns) reduce complexity

Example

# BERT context window: 512 tokens
Input: [long article 1000 tokens]
Truncate or split into:
  Chunk 1: [first 512 tokens]
  Chunk 2: [next 488 tokens]

# GPT-4 context window: 128k tokens
Input: [book 100k tokens]
Fit entirely in context; no truncation needed

# Implication for RAG:
Context window = space for retrieved documents + query + prompt
128k window allows ~100–200 document snippets
4k window allows ~5–10 snippets

Variants and history

Original Transformers (Vaswani et al., 2017) had 512 context (computational constraint). GPT-3 increased to 2k tokens. GPT-4 supports up to 128k tokens. PaLM and Llama 2 range 2k–4k. Sparse attention (Child et al., 2019) reduced complexity. Long-context models (Longformer, BigBird, 2020) handle 4k–16k efficiently. Rotary embeddings (Su et al., 2021) improve length extrapolation. Trend: ever-longer context windows as models scale.

When to use it

Consider context window when:

Processing long documents (books, scientific papers)
Building RAG systems (need space for retrieved docs + query)
Dialogue systems (history length limits)
Few-shot learning (examples + test case must fit)
Choosing between models (GPT-3.5 vs. GPT-4 window size)

Longer context is useful but expensive. Hybrid approaches: retrieve relevant context (RAG) rather than always processing entire corpus.