GPT

Gpt Language-Model Transformer Generative Autoregressive Needs-Review

What it is

GPT (Generative Pre-trained Transformer) is a series of large autoregressive language models developed by OpenAI (starting 2018). Unlike BERT’s bidirectional understanding, GPT uses causal (left-to-right) attention, making it naturally suited for text generation. GPT models are pre-trained at massive scale on diverse data, then fine-tuned or prompted for downstream tasks.

[illustrate: GPT decoder architecture with causal attention (can attend to past tokens only); generation process showing token-by-token sampling; scaling curve showing performance vs. model size]

How it works

Architecture:
- Decoder-only transformer
- Causal self-attention (mask future positions)
- 12 layers (GPT-1) to 96+ layers (GPT-4 scale)
Pre-training:
- Objective: Predict next token given previous tokens
- Data: Diverse web text, filtered and deduplicated
- Scaling: Larger models on more data generally perform better
Inference:
- Autoregressive generation: feed previous tokens, sample next token
- Temperature/top-k sampling controls diversity
Adaptation: Fine-tuning or in-context learning (prompt examples)

Example

# Pre-training:
Input: "The quick brown fox"
Predict: "jumps" (and continue)

# Generation (sampling):
Prompt: "Once upon a time"
Model generates: "...there was a young girl"
Each step: sample next token from P(next | all previous)

# Few-shot learning:
"Translate English to French:
dog → chien
cat → chat
house → "
Model: "maison"

Variants and history

GPT-1 (2018) showed that large-scale language modeling enables few-shot learning. GPT-2 (2019) scaled to 1.5B parameters; GPT-3 (2020) to 175B parameters, demonstrating strong in-context learning. GPT-3.5 and GPT-4 added instruction-tuning and RLHF. Competitors include PaLM, Llama, Claude, all following the GPT paradigm: large decoder-only models trained on diverse data.

When to use it

Use GPT-style models when:

Text generation is the primary task
In-context learning with few examples is sufficient
Pre-trained at massive scale offers value
You can afford inference latency (token-by-token slower than batch)
Few-shot or zero-shot adaptation is needed

GPT models are slower at inference than BERT for classification but excel at generation and reasoning. Instruction-tuned variants (GPT-3.5, GPT-4) handle arbitrary tasks.

GPT

What it is

How it works

Example

Variants and history

When to use it

See also