Language Model

Language-Model Nlp Probability Sequence Generative Needs-Review

What it is

A language model assigns probability to sequences of tokens: P(t_1, t_2, …, t_n). Equivalently, it predicts the next token given previous context: P(t_i | t_1, …, t_{i-1}). Language models are fundamental to NLP, used for generation, evaluation, classification, and retrieval tasks.

[illustrate: Token sequence with probability distribution over next token shown as bar chart; high probability for sensible continuations, low for gibberish]

How it works

Language models estimate conditional token probabilities using:

Counting-based (n-grams): P(t_i | t_{i-n}…t_{i-1}) from corpus counts
Neural (RNN, Transformer): Learn representations that predict next token
Factorization: P(sequence) = ∏ P(t_i | context_i)

Training uses:

Autoregressive (left-to-right): predict t_i from preceding tokens
Masked (bidirectional): predict masked token from surrounding context
Denoising: reconstruct from corrupted input

Example

# N-gram language model (trigram)
P("machine learning course") ≈ P("machine") × P("learning"|"machine") × P("course"|"machine learning")

# Neural language model
feed ["The", "quick", "brown"] to transformer
output distribution over next token:
  P("fox") = 0.45
  P("dog") = 0.08
  P("rabbit") = 0.05
  ... (probabilities for all vocab)

Variants and history

N-gram models date to the 1970s. Neural language models (Bengio et al., 2003) used feedforward networks. LSTM language models (2010s) handled long dependencies. Transformer language models (2017+) scaled to billions of parameters. BERT (masked), GPT (autoregressive) became foundational. Modern LLMs (GPT-4, PaLM, Llama) are language models scaled to trillions of tokens and billions of parameters.

When to use it

Use language models for:

Text generation and continuation
Ranking or scoring text
Transfer learning via pre-training
Evaluation of text quality
Grounding LLM outputs in retrieval
Language understanding tasks

Language model quality determines downstream task performance. Larger models (more parameters, more training data) generally outperform smaller ones.