Attention

Transformer

Attention-based neural architecture without recurrence; enables efficient parallel training and strong performance on language tasks. Published by Vaswani et al., 2017.
Self-Attention

Attention where query, key, and value vectors come from the same input sequence; enables capturing dependencies within a sequence.
Multi-Head Attention

Multiple parallel attention mechanisms operating on different subspaces; enables learning diverse interaction patterns simultaneously.
Attention Mechanism

Weighted aggregation of context vectors, allowing models to focus on relevant information. Fundamental to transformers and modern NLP.