Masked Language Model

What it is

Masked Language Modeling (MLM) is a bidirectional pre-training objective where random tokens are masked and predicted from context. MLM is the core objective of BERT and similar bidirectional encoders. Unlike causal language modeling (predict next token), MLM trains the model to understand context from both directions, making it powerful for understanding tasks.

[illustrate: Tokens with 15% masked; bidirectional context (left and right) informing predictions; decoder showing predicted tokens]

How it works

Masking strategy:
- Randomly select ~15% of tokens
- Replace 80% with [MASK]
- Replace 10% with random token
- Keep 10% unchanged
- (Prevents model from ignoring [MASK] tokens during fine-tuning)
Prediction:
- Encode masked sequence with bidirectional transformer
- Predict each masked token from full context
Loss: Cross-entropy on masked token prediction
Properties: Bidirectional (can attend left and right); unsupervised (no labels needed)

Example

Original: "Paris is the capital of France"

Masked (15% = 1 token):
"Paris is the [MASK] of France"

Model prediction:
P(capital) = 0.95
P(city) = 0.02
P(heart) = 0.01

Loss: -log(0.95)

# More complex example:
Original: "The quick brown fox jumps"
Masked: "The [MASK] brown [MASK] jumps"
→ Predict "quick" and "fox" from context

Variants and history

MLM was introduced by BERT (Devlin et al., 2018) and immediately became standard. Variants include:

ELECTRA: Replaced token detection (more challenging)
RoBERTa: Improved masking (longer masking, dynamic masking)
ALBERT: Parameter sharing across layers
XLNet: Permutation language modeling (hybrid approach)

MLM is now foundational for pre-training encoders; causal LM dominates for decoders.

When to use it

Use MLM for:

Pre-training bidirectional encoders
Unsupervised learning from massive corpora
Transfer learning to understanding tasks (classification, NER, QA)
Domain-adaptive pre-training
When bidirectional context is beneficial

MLM training is computationally expensive but amortized across many downstream tasks. Trade-off: bidirectional context powerful for understanding but less natural than autoregressive modeling.