Fine-Tuning
What it is
Fine-tuning is the process of taking a pre-trained model (trained on a large corpus with a general objective) and further training it on downstream task data. Fine-tuning requires far less task-specific data than training from scratch and typically outperforms or matches fully supervised approaches.
[illustrate: Pre-trained model → frozen lower layers → task-specific head → gradient flow through all layers; loss on task data]
How it works
-
Start with pre-trained weights: Load model trained on language modeling, masked LM, or similar
-
Add task-specific head: Replace or add output layer for target task
- Classification: linear layer + softmax
- QA: span selection head
- Generation: keep decoder, prompt-tune
-
Training:
- Use task-specific labeled data (usually much smaller than pre-training data)
- Optimize with task loss (cross-entropy, MSE, etc.)
- Often use lower learning rates to preserve learned features
-
Hyperparameter tuning: Learning rate, batch size, epochs (usually small)
Example
# Pre-trained BERT weights loaded
# Task: Sentiment classification (positive/negative)
# Add classification head:
input → BERT encoder → [CLS] token → linear(768 → 2) → softmax → logits
# Training:
for (text, label) in train_data:
logits = model(text)
loss = cross_entropy(logits, label)
loss.backward()
optimizer.step()
# Fine-tuning typically: 2–10 epochs, learning rate 2e-5 to 1e-4
Variants and history
Fine-tuning became standard with word embeddings (Word2Vec) around 2013. BERT (2018) popularized fine-tuning for understanding tasks. GPT showed few-shot and prompt-based learning but also fine-tunes for specialized tasks. LoRA (2021) and adapter modules enable parameter-efficient fine-tuning. Instruction-tuning fine-tunes on diverse task examples to improve generalization. Prompt-tuning learns only task-specific prompts, not model weights.
When to use it
Use fine-tuning when:
- Task-specific labeled data is available (100+ examples)
- Pre-trained models exist for your domain
- Training time/compute is limited
- You want to leverage large-scale pre-training
- Domain shift is moderate
Fine-tuning is efficient for most tasks. For very small datasets (<50 examples), prompt-tuning or few-shot learning may be better. For very large labeled datasets, full training from scratch may be competitive.