Transformer

What it is

The Transformer (Vaswani et al., 2017) is a neural architecture based entirely on self-attention mechanisms, without recurrence or convolution. It processes all tokens in parallel, enabling efficient training on large corpora and strong performance on sequence-to-sequence tasks, language modeling, and classification.

[illustrate: Transformer encoder-decoder architecture with multi-head attention blocks; input → embeddings → positional encoding → attention layers → output]

How it works

  1. Input embedding: Tokenized input is embedded and augmented with positional information
  2. Self-attention: For each position, compute weighted combination of all positions’ values
  3. Multi-head attention: Parallel attention heads capture different interaction patterns
  4. Feed-forward: Position-wise MLP after attention
  5. Layering: Stack multiple transformer blocks
  6. Output: Sequence of contextualized representations

Encoder-decoder variant:

  • Encoder: bidirectional self-attention on input
  • Decoder: causal self-attention (can attend to past tokens only) on output, cross-attention to encoder

Example

# Input: "the cat sat"
# Embedding + positional encoding: 3 tokens → 512-dim vectors

# Multi-head attention (8 heads):
# Each head learns different aspects
# Head 1 might learn: "cat" attends to "the" (article)
# Head 2 might learn: "sat" attends to "cat" (subject)
# Head 3 might learn: all attend to each other equally (syntax)

# Output: 3 contextualized vectors capturing interactions

Variants and history

Transformers appeared in 2017 and immediately became dominant, replacing RNNs and CNNs in NLP. BERT (Devlin et al., 2018) applied bidirectional transformers to language understanding. GPT series applied unidirectional transformers to language generation. Variants include Vision Transformers (images), Efficient Transformers (sparse attention, linear complexity), and Mixture of Experts (conditional computation). Transformers now underpin all large-scale NLP models.

When to use it

Use transformers when:

  • Building state-of-the-art NLP models
  • Language understanding or generation is needed
  • Parallel training is important
  • You can afford moderate computational cost
  • Transfer learning from pre-trained models is available

Transformers excel across all NLP tasks but require substantial compute for training. For inference on edge devices, distilled or quantized versions are available.

See also