Causal Language Model

What it is

Causal Language Modeling (CLM) is an autoregressive pre-training objective where the model predicts the next token given all previous tokens. CLM is the foundation of generative models like GPT. The “causal” aspect means the model can only attend to past (and current) tokens, not future ones, enforcing a unidirectional generation order.

[illustrate: Token sequence with causal mask preventing attention to future tokens; generation unfolding left-to-right; next-token prediction at each step]

How it works

  1. Causal attention mask:

    • For position i, attention only to positions ≤ i
    • Prevents information flow from future tokens
  2. Training objective:

    • Predict t_i from t_1, …, t_{i-1}
    • Compute cross-entropy loss for each position
    • Sum losses across all positions
  3. Inference:

    • Seed with prompt or [START] token
    • Predict next token given previous tokens
    • Sample or greedy-select next token; repeat until [END] or max length
  4. Properties: Unidirectional; natural for generation; enables efficient token-by-token sampling

Example

Training on: "The quick brown fox jumps"

Position 0: Predict "quick" from ["The"]
Position 1: Predict "brown" from ["The", "quick"]
Position 2: Predict "fox" from ["The", "quick", "brown"]
Position 3: Predict "jumps" from ["The", "quick", "brown", "fox"]

Loss = -log P(quick|The) - log P(brown|The,quick) - ...

Inference (generation):
Prompt: "The"
Step 1: Sample "quick" from P(token|The)
Step 2: Sample "brown" from P(token|The, quick)
... (repeat until [END] token or max length)
Output: "The quick brown fox..."

Variants and history

Autoregressive language modeling dates to n-gram models. LSTM language models (2010s) improved sequential prediction. Transformer language models (Vaswani et al., 2017) with causal masking enabled GPT and its successors. GPT series (OpenAI 2018–) scaled CLM to trillions of tokens. Variants include top-k sampling (restrict to k most probable tokens), temperature scaling (control randomness), and nucleus sampling (dynamic top-p).

When to use it

Use CLM for:

  • Text generation (continuation, summarization, translation)
  • Language model pre-training
  • Generative models and open-ended tasks
  • Token prediction and next-word inference
  • Learning unidirectional representations

CLM naturally fits generation but is suboptimal for understanding (bidirectional context better). Modern systems use both: CLM for generation, MLM or other objectives for understanding.

See also