Cosine Similarity
What it is
Cosine similarity measures the angle between two vectors, computing the cosine of the angle in high-dimensional space. It ranges from -1 (opposite) to 1 (identical), with 0 indicating orthogonality. Standard metric for document and vector similarity in information retrieval.
How it works
For vectors A and B:
Cosine(A, B) = (A · B) / (||A|| * ||B||)
= Σ(A[i] * B[i]) / sqrt(Σ A[i]²) * sqrt(Σ B[i]²)
Geometric interpretation: cosine of angle between vectors. Invariant to vector magnitude; depends only on direction. Efficient for sparse vectors (skip zero components).
[illustrate: Two vectors in 2D or 3D space with angle between them; show dot product and magnitude calculations]
Example
A = {term1: 2, term2: 1, term3: 0} B = {term1: 1, term2: 1, term3: 2}
A · B = 21 + 11 + 0*2 = 3 ||A|| = sqrt(4 + 1 + 0) = sqrt(5) ||B|| = sqrt(1 + 1 + 4) = sqrt(6)
Cosine = 3 / (sqrt(5) * sqrt(6)) ≈ 0.548
Variants and history
Foundation of vector space model in IR. Widely used for document, text, and embedding similarity. For non-negative vectors (common in IR), ranges [0, 1]. Extends naturally to higher dimensions and sparse representations. Computationally efficient on GPUs for large-scale similarity.
When to use it
Standard for TF-IDF and vector space model. Document similarity, query-document ranking. Embedding similarity (word2vec, BERT, etc.). Efficient for sparse and dense vectors. Scale-invariant; focuses on direction not magnitude. Suitable for large-scale approximate nearest-neighbor search.