Masked Language Model
What it is
Masked Language Modeling (MLM) is a bidirectional pre-training objective where random tokens are masked and predicted from context. MLM is the core objective of BERT and similar bidirectional encoders. Unlike causal language modeling (predict next token), MLM trains the model to understand context from both directions, making it powerful for understanding tasks.
[illustrate: Tokens with 15% masked; bidirectional context (left and right) informing predictions; decoder showing predicted tokens]
How it works
-
Masking strategy:
- Randomly select ~15% of tokens
- Replace 80% with [MASK]
- Replace 10% with random token
- Keep 10% unchanged
- (Prevents model from ignoring [MASK] tokens during fine-tuning)
-
Prediction:
- Encode masked sequence with bidirectional transformer
- Predict each masked token from full context
-
Loss: Cross-entropy on masked token prediction
-
Properties: Bidirectional (can attend left and right); unsupervised (no labels needed)
Example
Original: "Paris is the capital of France"
Masked (15% = 1 token):
"Paris is the [MASK] of France"
Model prediction:
P(capital) = 0.95
P(city) = 0.02
P(heart) = 0.01
Loss: -log(0.95)
# More complex example:
Original: "The quick brown fox jumps"
Masked: "The [MASK] brown [MASK] jumps"
→ Predict "quick" and "fox" from context
Variants and history
MLM was introduced by BERT (Devlin et al., 2018) and immediately became standard. Variants include:
- ELECTRA: Replaced token detection (more challenging)
- RoBERTa: Improved masking (longer masking, dynamic masking)
- ALBERT: Parameter sharing across layers
- XLNet: Permutation language modeling (hybrid approach)
MLM is now foundational for pre-training encoders; causal LM dominates for decoders.
When to use it
Use MLM for:
- Pre-training bidirectional encoders
- Unsupervised learning from massive corpora
- Transfer learning to understanding tasks (classification, NER, QA)
- Domain-adaptive pre-training
- When bidirectional context is beneficial
MLM training is computationally expensive but amortized across many downstream tasks. Trade-off: bidirectional context powerful for understanding but less natural than autoregressive modeling.