Vector Space Model

What it is

The vector space model (VSM) is a framework for representing text as vectors in a term-dimensional space and measuring relevance as geometric similarity. Both documents and queries become points in the same vector space; the closer a document is to the query, the more relevant it is considered.

VSM was the dominant theoretical framework for information retrieval from the 1970s through the 2000s. Its core insight — that term weights capture both document content and query intent — underpins TF-IDF and influenced all subsequent ranking models including BM25 and dense retrieval.

How it works

Given a vocabulary of V unique terms, every document is represented as a V-dimensional vector where each dimension corresponds to a term weight:

d = [w(t₁, d), w(t₂, d), ..., w(tᵥ, d)]

A query is similarly represented:

q = [w(t₁, q), w(t₂, q), ..., w(tᵥ, q)]

The relevance of document d to query q is the cosine similarity between their vectors:

sim(d, q) = (d · q) / (|d| · |q|)
           = Σ w(t, d) · w(t, q) / (||d|| · ||q||)

Cosine similarity ranges from 0 (no shared terms) to 1 (identical term distributions). It normalises for document length — a long and a short document with the same term distribution score equally.

The typical term weight is TF-IDF: w(t, d) = tf(t, d) × idf(t).

[illustrate: 2D vector space with two term axes (e.g. “cat” and “dog”) — three document vectors and one query vector plotted as arrows from the origin — cosine similarity shown as the angle between each document vector and the query vector, with the closest document highlighted]

Example

Three documents, vocabulary = {cat, dog, sat}:

Doc cat dog sat TF-IDF vector
D1 2 0 1 [2·idf_cat, 0, 1·idf_sat]
D2 0 2 1 [0, 2·idf_dog, 1·idf_sat]
D3 1 1 0 [1·idf_cat, 1·idf_dog, 0]

Query "cat sat" → q = [1·idf_cat, 0, 1·idf_sat]

Cosine similarity favours D1 (contains both cat and sat) over D3 (contains cat but not sat) and D2 (shares only sat).

Variants and history

Gerard Salton introduced the vector space model at Cornell University in the 1970s as part of the SMART information retrieval system. His 1975 paper “A vector space model for automatic indexing” formalised the framework.

VSM has several weaknesses that motivated later work:

  • Vocabulary mismatch — documents using synonyms of query terms don’t match.
  • Term independence — the model assumes all terms are orthogonal and uncorrelated.
  • High dimensionality — the term space can have millions of dimensions.

These limitations were addressed by:

  • Latent Semantic Analysis (LSA) — applies SVD to find latent semantic dimensions.
  • BM25 — retains the TF-IDF intuition but adds saturation and length normalisation.
  • Dense retrieval — replaces sparse term vectors with low-dimensional neural embeddings.

When to use it

VSM is rarely implemented directly today — BM25 supersedes it for sparse retrieval. However, the framework remains:

  • Conceptually useful — it explains why TF-IDF and cosine similarity work.
  • Practically relevant for dense retrieval — neural embedding models are, in effect, VSMs with learned low-dimensional vectors rather than sparse TF-IDF vectors.
  • Used in document clustering and classification — TF-IDF vectors are still effective features for text classification.

See also