BERT

Bert Language-Model Transformer Pre-Training Devlin Needs-Review

What it is

BERT (Devlin et al., 2018) is a bidirectional transformer pre-trained on massive corpora using masked language modeling (MLM) and next-sentence prediction objectives. Unlike earlier unidirectional language models, BERT can see context from both directions, making it powerful for understanding tasks. Pre-trained BERT can be fine-tuned for downstream tasks with minimal additional training.

[illustrate: BERT architecture with bidirectional attention; masked tokens ([MASK]) in input; pre-training objective showing prediction of masked tokens]

How it works

Architecture:
- 12–24 transformer encoder layers
- 768–1024 hidden dimensions
- Multi-head self-attention (12 heads)
Pre-training objectives:
- Masked Language Modeling (MLM): Randomly mask 15% of tokens; predict masked tokens from context
- Next Sentence Prediction (NSP): Predict whether sentence B follows A
Pre-training data: Wikipedia + BookCorpus (3.3B tokens)
Fine-tuning: Add task-specific head (classifier, span selector) and train on downstream data

Example

# Pre-training MLM:
Input: "The [MASK] sat on the [MASK]."
Target: Predict "cat" for first mask, "mat" for second

# Fine-tuning for text classification:
Input: "Great movie, highly recommend."
Head: linear(BERT([CLS] token)) → softmax(2 classes)
Output: "positive"

Variants and history

BERT appeared in 2018 and revolutionized NLP, showing pre-training + fine-tuning outperforms task-specific training. Variants include RoBERTa (improved pre-training), ALBERT (parameter sharing), DistilBERT (distilled, faster), ELECTRA (replaced token detection), and multilingual BERT (104 languages). BERT-style pre-training became standard; larger models (GPT-3, T5) extended the paradigm.

When to use it

Use BERT when:

Fine-tuning for classification, tagging, or QA
Pre-trained contextual embeddings help your task
You want a balanced architecture (not generation-focused)
Inference speed is moderate priority
Transfer learning from massive pre-training is beneficial

BERT excels for understanding tasks but is slower at inference than bi-encoders and less suitable for generation than GPT-style models. Distilled versions (DistilBERT) offer speed improvements.