Attention Mechanism

What it is

Attention is a mechanism for computing a weighted combination of values based on relevance to a query. In neural networks, it allows models to dynamically focus on important parts of input, rather than processing everything with equal weight. Attention replaced RNN bottlenecks and is the core of transformers.

[illustrate: Query vector; set of key-value pairs; attention weights as heatmap showing alignment between query and keys; weighted sum of values]

How it works

Scaled Dot-Product Attention:

  1. Compute similarity: scores = Q · K^T / √d_k
  2. Normalize: weights = softmax(scores)
  3. Aggregate: output = weights · V

Where:

  • Q (query): what we’re looking for
  • K (key): what we’re searching over
  • V (value): content to aggregate
  • d_k: dimension of keys (for stability)

Intuition: For each query, find most similar keys; use their weights to aggregate values.

Example

Query: "What is the capital of France?"
Context: "France is a country. Paris is the capital."

Q = embedding("France?")
K = [embedding("France"), embedding("is"), embedding("a"), ..., embedding("Paris"), ...]
V = [value_vec("France"), value_vec("is"), ..., value_vec("Paris"), ...]

scores = Q · K^T = [high, low, low, ..., very_high, ...]
weights = softmax(scores) = [0.05, 0.02, ..., 0.6, ...]
output = weights · V ≈ heavily weighted toward Paris representation

Variants and history

Attention was introduced for machine translation (Bahdanau et al., 2014) to improve RNN encoder-decoder models. Self-attention (Vaswani et al., 2017) applied attention to all positions in a sequence, enabling transformers. Variants include multi-head attention (parallel heads), sparse attention (local windows), linear attention (approximations to softmax), and cross-attention (attending to different input sequences).

When to use it

Use attention in:

  • Sequence-to-sequence models (translation, summarization)
  • Language understanding tasks (classification, tagging)
  • Any model where selective focus improves performance
  • Interpretability needs (visualize attention weights)
  • Cross-modal learning (attend image regions given text query)

Attention is now standard in modern NLP architectures. Cost: O(n^2) complexity for full attention; sparse variants trade off some performance for linear complexity.

See also