Positional Encoding
What it is
Positional encoding (or positional embedding) augments token embeddings with information about their position in the sequence. Transformers process all tokens in parallel without inherent sequence order, so positional information must be explicitly added. This is typically done by adding sinusoidal vectors or learned embeddings to each token’s representation.
[illustrate: Token embeddings added with sinusoidal positional encodings; visualization showing periodic patterns of different frequencies]
How it works
Sinusoidal positional encoding (Vaswani et al., 2017):
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where:
- pos: position in sequence (0, 1, 2, …)
- i: dimension index (0, 1, …, d_model/2 - 1)
- d_model: total embedding dimension
Properties:
- Different frequencies for different dimensions
- Encodes absolute position via sinusoids
- Extrapolates to longer sequences unseen during training
- Can be computed once offline
Learned positional embeddings (common alternative):
- Each position has learnable embedding vector
- More flexible but requires position bounds
- May not generalize beyond training length
Example
# d_model = 512, sequence "the cat sat"
Token "the" at position 0:
PE(0, 0) = sin(0) = 0
PE(0, 1) = cos(0) = 1
PE(0, 2) = sin(0 / 10000^(2/512)) = ...
... (512 values)
Embedding = token_embedding + positional_encoding
Token "cat" at position 1:
PE(1, 0) = sin(1 / 10000^(0/512)) ≈ 0.84
PE(1, 1) = cos(1 / 10000^(0/512)) ≈ 0.54
... (different from position 0)
Variants and history
Sinusoidal positional encoding was introduced in the Transformer paper (Vaswani et al., 2017). Empirically, it works as well as learned embeddings while extrapolating better. Later work introduced relative positional embeddings (Shaw et al., 2018) encoding relative distances rather than absolute positions, improving generalization. Rotary positional embeddings (RoPE) and ALiBi further improved extrapolation and efficiency.
When to use it
Use positional encoding in:
- Transformer architectures (encoder, decoder)
- Any architecture requiring sequence order awareness
- When extrapolation beyond training lengths is needed (sinusoidal)
- When you want simplicity without learnable parameters (sinusoidal)
Sinusoidal encoding is standard; learned embeddings are simpler conceptually but require position limits. For long-context models, relative positional approaches are preferred.