Language Model
What it is
A language model assigns probability to sequences of tokens: P(t_1, t_2, …, t_n). Equivalently, it predicts the next token given previous context: P(t_i | t_1, …, t_{i-1}). Language models are fundamental to NLP, used for generation, evaluation, classification, and retrieval tasks.
[illustrate: Token sequence with probability distribution over next token shown as bar chart; high probability for sensible continuations, low for gibberish]
How it works
Language models estimate conditional token probabilities using:
- Counting-based (n-grams): P(t_i | t_{i-n}…t_{i-1}) from corpus counts
- Neural (RNN, Transformer): Learn representations that predict next token
- Factorization: P(sequence) = ∏ P(t_i | context_i)
Training uses:
- Autoregressive (left-to-right): predict t_i from preceding tokens
- Masked (bidirectional): predict masked token from surrounding context
- Denoising: reconstruct from corrupted input
Example
# N-gram language model (trigram)
P("machine learning course") ≈ P("machine") × P("learning"|"machine") × P("course"|"machine learning")
# Neural language model
feed ["The", "quick", "brown"] to transformer
output distribution over next token:
P("fox") = 0.45
P("dog") = 0.08
P("rabbit") = 0.05
... (probabilities for all vocab)
Variants and history
N-gram models date to the 1970s. Neural language models (Bengio et al., 2003) used feedforward networks. LSTM language models (2010s) handled long dependencies. Transformer language models (2017+) scaled to billions of parameters. BERT (masked), GPT (autoregressive) became foundational. Modern LLMs (GPT-4, PaLM, Llama) are language models scaled to trillions of tokens and billions of parameters.
When to use it
Use language models for:
- Text generation and continuation
- Ranking or scoring text
- Transfer learning via pre-training
- Evaluation of text quality
- Grounding LLM outputs in retrieval
- Language understanding tasks
Language model quality determines downstream task performance. Larger models (more parameters, more training data) generally outperform smaller ones.