Transformer

Transformer Attention Neural Architecture Vaswani Needs-Review

What it is

The Transformer (Vaswani et al., 2017) is a neural architecture based entirely on self-attention mechanisms, without recurrence or convolution. It processes all tokens in parallel, enabling efficient training on large corpora and strong performance on sequence-to-sequence tasks, language modeling, and classification.

[illustrate: Transformer encoder-decoder architecture with multi-head attention blocks; input → embeddings → positional encoding → attention layers → output]

How it works

Input embedding: Tokenized input is embedded and augmented with positional information
Self-attention: For each position, compute weighted combination of all positions’ values
Multi-head attention: Parallel attention heads capture different interaction patterns
Feed-forward: Position-wise MLP after attention
Layering: Stack multiple transformer blocks
Output: Sequence of contextualized representations

Encoder-decoder variant:

Encoder: bidirectional self-attention on input
Decoder: causal self-attention (can attend to past tokens only) on output, cross-attention to encoder

Example

# Input: "the cat sat"
# Embedding + positional encoding: 3 tokens → 512-dim vectors

# Multi-head attention (8 heads):
# Each head learns different aspects
# Head 1 might learn: "cat" attends to "the" (article)
# Head 2 might learn: "sat" attends to "cat" (subject)
# Head 3 might learn: all attend to each other equally (syntax)

# Output: 3 contextualized vectors capturing interactions

Variants and history

Transformers appeared in 2017 and immediately became dominant, replacing RNNs and CNNs in NLP. BERT (Devlin et al., 2018) applied bidirectional transformers to language understanding. GPT series applied unidirectional transformers to language generation. Variants include Vision Transformers (images), Efficient Transformers (sparse attention, linear complexity), and Mixture of Experts (conditional computation). Transformers now underpin all large-scale NLP models.

When to use it

Use transformers when:

Building state-of-the-art NLP models
Language understanding or generation is needed
Parallel training is important
You can afford moderate computational cost
Transfer learning from pre-trained models is available

Transformers excel across all NLP tasks but require substantial compute for training. For inference on edge devices, distilled or quantized versions are available.