BERT

What it is

BERT (Devlin et al., 2018) is a bidirectional transformer pre-trained on massive corpora using masked language modeling (MLM) and next-sentence prediction objectives. Unlike earlier unidirectional language models, BERT can see context from both directions, making it powerful for understanding tasks. Pre-trained BERT can be fine-tuned for downstream tasks with minimal additional training.

[illustrate: BERT architecture with bidirectional attention; masked tokens ([MASK]) in input; pre-training objective showing prediction of masked tokens]

How it works

  1. Architecture:

    • 12–24 transformer encoder layers
    • 768–1024 hidden dimensions
    • Multi-head self-attention (12 heads)
  2. Pre-training objectives:

    • Masked Language Modeling (MLM): Randomly mask 15% of tokens; predict masked tokens from context
    • Next Sentence Prediction (NSP): Predict whether sentence B follows A
  3. Pre-training data: Wikipedia + BookCorpus (3.3B tokens)

  4. Fine-tuning: Add task-specific head (classifier, span selector) and train on downstream data

Example

# Pre-training MLM:
Input: "The [MASK] sat on the [MASK]."
Target: Predict "cat" for first mask, "mat" for second

# Fine-tuning for text classification:
Input: "Great movie, highly recommend."
Head: linear(BERT([CLS] token)) → softmax(2 classes)
Output: "positive"

Variants and history

BERT appeared in 2018 and revolutionized NLP, showing pre-training + fine-tuning outperforms task-specific training. Variants include RoBERTa (improved pre-training), ALBERT (parameter sharing), DistilBERT (distilled, faster), ELECTRA (replaced token detection), and multilingual BERT (104 languages). BERT-style pre-training became standard; larger models (GPT-3, T5) extended the paradigm.

When to use it

Use BERT when:

  • Fine-tuning for classification, tagging, or QA
  • Pre-trained contextual embeddings help your task
  • You want a balanced architecture (not generation-focused)
  • Inference speed is moderate priority
  • Transfer learning from massive pre-training is beneficial

BERT excels for understanding tasks but is slower at inference than bi-encoders and less suitable for generation than GPT-style models. Distilled versions (DistilBERT) offer speed improvements.

See also