GPT
What it is
GPT (Generative Pre-trained Transformer) is a series of large autoregressive language models developed by OpenAI (starting 2018). Unlike BERT’s bidirectional understanding, GPT uses causal (left-to-right) attention, making it naturally suited for text generation. GPT models are pre-trained at massive scale on diverse data, then fine-tuned or prompted for downstream tasks.
[illustrate: GPT decoder architecture with causal attention (can attend to past tokens only); generation process showing token-by-token sampling; scaling curve showing performance vs. model size]
How it works
-
Architecture:
- Decoder-only transformer
- Causal self-attention (mask future positions)
- 12 layers (GPT-1) to 96+ layers (GPT-4 scale)
-
Pre-training:
- Objective: Predict next token given previous tokens
- Data: Diverse web text, filtered and deduplicated
- Scaling: Larger models on more data generally perform better
-
Inference:
- Autoregressive generation: feed previous tokens, sample next token
- Temperature/top-k sampling controls diversity
-
Adaptation: Fine-tuning or in-context learning (prompt examples)
Example
# Pre-training:
Input: "The quick brown fox"
Predict: "jumps" (and continue)
# Generation (sampling):
Prompt: "Once upon a time"
Model generates: "...there was a young girl"
Each step: sample next token from P(next | all previous)
# Few-shot learning:
"Translate English to French:
dog → chien
cat → chat
house → "
Model: "maison"
Variants and history
GPT-1 (2018) showed that large-scale language modeling enables few-shot learning. GPT-2 (2019) scaled to 1.5B parameters; GPT-3 (2020) to 175B parameters, demonstrating strong in-context learning. GPT-3.5 and GPT-4 added instruction-tuning and RLHF. Competitors include PaLM, Llama, Claude, all following the GPT paradigm: large decoder-only models trained on diverse data.
When to use it
Use GPT-style models when:
- Text generation is the primary task
- In-context learning with few examples is sufficient
- Pre-trained at massive scale offers value
- You can afford inference latency (token-by-token slower than batch)
- Few-shot or zero-shot adaptation is needed
GPT models are slower at inference than BERT for classification but excel at generation and reasoning. Instruction-tuned variants (GPT-3.5, GPT-4) handle arbitrary tasks.