Self-Attention

What it is

Self-attention is an attention mechanism applied to the same sequence: each position attends to all other positions (and itself) in the sequence. This enables the model to learn rich contextual representations where each token’s representation incorporates information from the entire sequence.

[illustrate: Sequence of tokens; attention matrix showing which tokens attend to which others; example showing “cat” token with high attention weights to “the” and “sat”]

How it works

For a sequence of n tokens with embeddings x_1, …, x_n:

  1. Project: Compute Q, K, V by linear projection of same input

    • Q = X · W_Q
    • K = X · W_K
    • V = X · W_V
  2. Attention: Apply scaled dot-product attention

    • scores = Q · K^T / √d_k
    • weights = softmax(scores)
    • output = weights · V
  3. Result: Each token’s output incorporates weighted information from all tokens

Key property: No recurrence; all positions processed in parallel, enabling efficient GPU computation.

Example

Sequence: ["The", "quick", "brown", "fox"]
Embeddings: [e_1, e_2, e_3, e_4]

# Self-attention for "brown" (position 3):
Q_3 = W_Q · e_3
K = W_K · [e_1, e_2, e_3, e_4]
scores = Q_3 · K^T = [q3k1, q3k2, q3k3, q3k4]

# Typical attention pattern:
weights ≈ [0.1, 0.3, 0.5, 0.1]  (attends most to self and "quick")

# Output combines all tokens weighted by attention
output_3 = 0.1×V_1 + 0.3×V_2 + 0.5×V_3 + 0.1×V_4

Variants and history

Self-attention appeared in the Transformer paper (Vaswani et al., 2017) as the core mechanism replacing RNNs. It became foundational to BERT, GPT, and all modern transformers. Variants include causal self-attention (can attend only to past positions), sparse self-attention (attend to local windows or learned patterns), linear attention (approximate softmax with linear cost), and relative positional attention (encode relative distances).

When to use it

Use self-attention when:

  • You need bidirectional context (encoding, understanding)
  • Parallel computation is important
  • Capturing long-range dependencies is critical
  • You can afford O(n^2) attention computation
  • Building encoder blocks in transformers

Self-attention is now standard. For very long sequences, sparse or linear variants reduce complexity.

See also