Fine-Tuning

What it is

Fine-tuning is the process of taking a pre-trained model (trained on a large corpus with a general objective) and further training it on downstream task data. Fine-tuning requires far less task-specific data than training from scratch and typically outperforms or matches fully supervised approaches.

[illustrate: Pre-trained model → frozen lower layers → task-specific head → gradient flow through all layers; loss on task data]

How it works

Start with pre-trained weights: Load model trained on language modeling, masked LM, or similar
Add task-specific head: Replace or add output layer for target task
- Classification: linear layer + softmax
- QA: span selection head
- Generation: keep decoder, prompt-tune
Training:
- Use task-specific labeled data (usually much smaller than pre-training data)
- Optimize with task loss (cross-entropy, MSE, etc.)
- Often use lower learning rates to preserve learned features
Hyperparameter tuning: Learning rate, batch size, epochs (usually small)

Example

# Pre-trained BERT weights loaded

# Task: Sentiment classification (positive/negative)
# Add classification head:
input → BERT encoder → [CLS] token → linear(768 → 2) → softmax → logits

# Training:
for (text, label) in train_data:
    logits = model(text)
    loss = cross_entropy(logits, label)
    loss.backward()
    optimizer.step()

# Fine-tuning typically: 2–10 epochs, learning rate 2e-5 to 1e-4

Variants and history

Fine-tuning became standard with word embeddings (Word2Vec) around 2013. BERT (2018) popularized fine-tuning for understanding tasks. GPT showed few-shot and prompt-based learning but also fine-tunes for specialized tasks. LoRA (2021) and adapter modules enable parameter-efficient fine-tuning. Instruction-tuning fine-tunes on diverse task examples to improve generalization. Prompt-tuning learns only task-specific prompts, not model weights.

When to use it

Use fine-tuning when:

Task-specific labeled data is available (100+ examples)
Pre-trained models exist for your domain
Training time/compute is limited
You want to leverage large-scale pre-training
Domain shift is moderate

Fine-tuning is efficient for most tasks. For very small datasets (<50 examples), prompt-tuning or few-shot learning may be better. For very large labeled datasets, full training from scratch may be competitive.