Fine-Tuning

What it is

Fine-tuning is the process of taking a pre-trained model (trained on a large corpus with a general objective) and further training it on downstream task data. Fine-tuning requires far less task-specific data than training from scratch and typically outperforms or matches fully supervised approaches.

[illustrate: Pre-trained model → frozen lower layers → task-specific head → gradient flow through all layers; loss on task data]

How it works

  1. Start with pre-trained weights: Load model trained on language modeling, masked LM, or similar

  2. Add task-specific head: Replace or add output layer for target task

    • Classification: linear layer + softmax
    • QA: span selection head
    • Generation: keep decoder, prompt-tune
  3. Training:

    • Use task-specific labeled data (usually much smaller than pre-training data)
    • Optimize with task loss (cross-entropy, MSE, etc.)
    • Often use lower learning rates to preserve learned features
  4. Hyperparameter tuning: Learning rate, batch size, epochs (usually small)

Example

# Pre-trained BERT weights loaded

# Task: Sentiment classification (positive/negative)
# Add classification head:
input → BERT encoder → [CLS] token → linear(768 → 2) → softmax → logits

# Training:
for (text, label) in train_data:
    logits = model(text)
    loss = cross_entropy(logits, label)
    loss.backward()
    optimizer.step()

# Fine-tuning typically: 2–10 epochs, learning rate 2e-5 to 1e-4

Variants and history

Fine-tuning became standard with word embeddings (Word2Vec) around 2013. BERT (2018) popularized fine-tuning for understanding tasks. GPT showed few-shot and prompt-based learning but also fine-tunes for specialized tasks. LoRA (2021) and adapter modules enable parameter-efficient fine-tuning. Instruction-tuning fine-tunes on diverse task examples to improve generalization. Prompt-tuning learns only task-specific prompts, not model weights.

When to use it

Use fine-tuning when:

  • Task-specific labeled data is available (100+ examples)
  • Pre-trained models exist for your domain
  • Training time/compute is limited
  • You want to leverage large-scale pre-training
  • Domain shift is moderate

Fine-tuning is efficient for most tasks. For very small datasets (<50 examples), prompt-tuning or few-shot learning may be better. For very large labeled datasets, full training from scratch may be competitive.

See also