Language Model

What it is

A language model assigns probability to sequences of tokens: P(t_1, t_2, …, t_n). Equivalently, it predicts the next token given previous context: P(t_i | t_1, …, t_{i-1}). Language models are fundamental to NLP, used for generation, evaluation, classification, and retrieval tasks.

[illustrate: Token sequence with probability distribution over next token shown as bar chart; high probability for sensible continuations, low for gibberish]

How it works

Language models estimate conditional token probabilities using:

  1. Counting-based (n-grams): P(t_i | t_{i-n}…t_{i-1}) from corpus counts
  2. Neural (RNN, Transformer): Learn representations that predict next token
  3. Factorization: P(sequence) = ∏ P(t_i | context_i)

Training uses:

  • Autoregressive (left-to-right): predict t_i from preceding tokens
  • Masked (bidirectional): predict masked token from surrounding context
  • Denoising: reconstruct from corrupted input

Example

# N-gram language model (trigram)
P("machine learning course") ≈ P("machine") × P("learning"|"machine") × P("course"|"machine learning")

# Neural language model
feed ["The", "quick", "brown"] to transformer
output distribution over next token:
  P("fox") = 0.45
  P("dog") = 0.08
  P("rabbit") = 0.05
  ... (probabilities for all vocab)

Variants and history

N-gram models date to the 1970s. Neural language models (Bengio et al., 2003) used feedforward networks. LSTM language models (2010s) handled long dependencies. Transformer language models (2017+) scaled to billions of parameters. BERT (masked), GPT (autoregressive) became foundational. Modern LLMs (GPT-4, PaLM, Llama) are language models scaled to trillions of tokens and billions of parameters.

When to use it

Use language models for:

  • Text generation and continuation
  • Ranking or scoring text
  • Transfer learning via pre-training
  • Evaluation of text quality
  • Grounding LLM outputs in retrieval
  • Language understanding tasks

Language model quality determines downstream task performance. Larger models (more parameters, more training data) generally outperform smaller ones.

See also