BERT
What it is
BERT (Devlin et al., 2018) is a bidirectional transformer pre-trained on massive corpora using masked language modeling (MLM) and next-sentence prediction objectives. Unlike earlier unidirectional language models, BERT can see context from both directions, making it powerful for understanding tasks. Pre-trained BERT can be fine-tuned for downstream tasks with minimal additional training.
[illustrate: BERT architecture with bidirectional attention; masked tokens ([MASK]) in input; pre-training objective showing prediction of masked tokens]
How it works
-
Architecture:
- 12–24 transformer encoder layers
- 768–1024 hidden dimensions
- Multi-head self-attention (12 heads)
-
Pre-training objectives:
- Masked Language Modeling (MLM): Randomly mask 15% of tokens; predict masked tokens from context
- Next Sentence Prediction (NSP): Predict whether sentence B follows A
-
Pre-training data: Wikipedia + BookCorpus (3.3B tokens)
-
Fine-tuning: Add task-specific head (classifier, span selector) and train on downstream data
Example
# Pre-training MLM:
Input: "The [MASK] sat on the [MASK]."
Target: Predict "cat" for first mask, "mat" for second
# Fine-tuning for text classification:
Input: "Great movie, highly recommend."
Head: linear(BERT([CLS] token)) → softmax(2 classes)
Output: "positive"
Variants and history
BERT appeared in 2018 and revolutionized NLP, showing pre-training + fine-tuning outperforms task-specific training. Variants include RoBERTa (improved pre-training), ALBERT (parameter sharing), DistilBERT (distilled, faster), ELECTRA (replaced token detection), and multilingual BERT (104 languages). BERT-style pre-training became standard; larger models (GPT-3, T5) extended the paradigm.
When to use it
Use BERT when:
- Fine-tuning for classification, tagging, or QA
- Pre-trained contextual embeddings help your task
- You want a balanced architecture (not generation-focused)
- Inference speed is moderate priority
- Transfer learning from massive pre-training is beneficial
BERT excels for understanding tasks but is slower at inference than bi-encoders and less suitable for generation than GPT-style models. Distilled versions (DistilBERT) offer speed improvements.