Self-Attention
What it is
Self-attention is an attention mechanism applied to the same sequence: each position attends to all other positions (and itself) in the sequence. This enables the model to learn rich contextual representations where each token’s representation incorporates information from the entire sequence.
[illustrate: Sequence of tokens; attention matrix showing which tokens attend to which others; example showing “cat” token with high attention weights to “the” and “sat”]
How it works
For a sequence of n tokens with embeddings x_1, …, x_n:
-
Project: Compute Q, K, V by linear projection of same input
- Q = X · W_Q
- K = X · W_K
- V = X · W_V
-
Attention: Apply scaled dot-product attention
- scores = Q · K^T / √d_k
- weights = softmax(scores)
- output = weights · V
-
Result: Each token’s output incorporates weighted information from all tokens
Key property: No recurrence; all positions processed in parallel, enabling efficient GPU computation.
Example
Sequence: ["The", "quick", "brown", "fox"]
Embeddings: [e_1, e_2, e_3, e_4]
# Self-attention for "brown" (position 3):
Q_3 = W_Q · e_3
K = W_K · [e_1, e_2, e_3, e_4]
scores = Q_3 · K^T = [q3k1, q3k2, q3k3, q3k4]
# Typical attention pattern:
weights ≈ [0.1, 0.3, 0.5, 0.1] (attends most to self and "quick")
# Output combines all tokens weighted by attention
output_3 = 0.1×V_1 + 0.3×V_2 + 0.5×V_3 + 0.1×V_4
Variants and history
Self-attention appeared in the Transformer paper (Vaswani et al., 2017) as the core mechanism replacing RNNs. It became foundational to BERT, GPT, and all modern transformers. Variants include causal self-attention (can attend only to past positions), sparse self-attention (attend to local windows or learned patterns), linear attention (approximate softmax with linear cost), and relative positional attention (encode relative distances).
When to use it
Use self-attention when:
- You need bidirectional context (encoding, understanding)
- Parallel computation is important
- Capturing long-range dependencies is critical
- You can afford O(n^2) attention computation
- Building encoder blocks in transformers
Self-attention is now standard. For very long sequences, sparse or linear variants reduce complexity.