Attention Mechanism
What it is
Attention is a mechanism for computing a weighted combination of values based on relevance to a query. In neural networks, it allows models to dynamically focus on important parts of input, rather than processing everything with equal weight. Attention replaced RNN bottlenecks and is the core of transformers.
[illustrate: Query vector; set of key-value pairs; attention weights as heatmap showing alignment between query and keys; weighted sum of values]
How it works
Scaled Dot-Product Attention:
- Compute similarity: scores = Q · K^T / √d_k
- Normalize: weights = softmax(scores)
- Aggregate: output = weights · V
Where:
- Q (query): what we’re looking for
- K (key): what we’re searching over
- V (value): content to aggregate
- d_k: dimension of keys (for stability)
Intuition: For each query, find most similar keys; use their weights to aggregate values.
Example
Query: "What is the capital of France?"
Context: "France is a country. Paris is the capital."
Q = embedding("France?")
K = [embedding("France"), embedding("is"), embedding("a"), ..., embedding("Paris"), ...]
V = [value_vec("France"), value_vec("is"), ..., value_vec("Paris"), ...]
scores = Q · K^T = [high, low, low, ..., very_high, ...]
weights = softmax(scores) = [0.05, 0.02, ..., 0.6, ...]
output = weights · V ≈ heavily weighted toward Paris representation
Variants and history
Attention was introduced for machine translation (Bahdanau et al., 2014) to improve RNN encoder-decoder models. Self-attention (Vaswani et al., 2017) applied attention to all positions in a sequence, enabling transformers. Variants include multi-head attention (parallel heads), sparse attention (local windows), linear attention (approximations to softmax), and cross-attention (attending to different input sequences).
When to use it
Use attention in:
- Sequence-to-sequence models (translation, summarization)
- Language understanding tasks (classification, tagging)
- Any model where selective focus improves performance
- Interpretability needs (visualize attention weights)
- Cross-modal learning (attend image regions given text query)
Attention is now standard in modern NLP architectures. Cost: O(n^2) complexity for full attention; sparse variants trade off some performance for linear complexity.