Query Expansion

What it is

Query expansion is the process of rewriting or augmenting a search query before it is executed against an index. Instead of matching only the exact terms the user typed, the search engine adds related terms — synonyms, morphological variants, acronym expansions, or semantically similar phrases — so that relevant documents are not missed simply because they use different vocabulary.

The problem it solves is called vocabulary mismatch: a user searching for “automobile” will not retrieve a document that only contains the word “car” unless the system bridges the two terms. Query expansion is one of several techniques for doing so.

How it works

Expansion can happen at query time (dynamic) or be baked into the index (index-time synonym expansion). The steps for query-time expansion are:

  1. Parse the original query into tokens.
  2. For each token (or for the query as a whole), look up candidate expansion terms from a resource — a synonym dictionary, a thesaurus, a co-occurrence model, or an embedding nearest-neighbour lookup.
  3. Construct an expanded query that includes both the original and the expansion terms, usually as optional (OR-like) clauses weighted lower than the original terms.
  4. Execute the expanded query against the index; score and rank results normally (e.g. with BM25).

The weighting step is important: expansion terms typically receive a boost value below 1.0 so that an exact match on the original query still outranks a match on only an expanded synonym.

Example

Original query: “cheap flights”

After expansion against a travel-domain synonym list:

Original term Expansion terms added
cheap affordable, budget, low-cost, discount
flights airfare, air travel, plane tickets

Expanded query (Elasticsearch/OpenSearch multi_match style):

{
  "bool": {
    "should": [
      { "match": { "body": { "query": "cheap flights", "boost": 1.0 } } },
      { "match": { "body": { "query": "affordable airfare", "boost": 0.5 } } },
      { "match": { "body": { "query": "budget air travel", "boost": 0.5 } } },
      { "match": { "body": { "query": "low-cost plane tickets", "boost": 0.5 } } }
    ]
  }
}

A document titled “Budget Airline Ticket Deals” now scores and surfaces, even though it shares no tokens with the original query.

Variants and history

Query expansion has been studied since the 1960s in classical IR. The major approaches are:

Thesaurus-based expansion uses a curated synonym resource such as WordNet or a domain-specific glossary. Reliable but limited to terms the thesaurus covers; misses emerging vocabulary and jargon.

Pseudo-relevance feedback (PRF) — also called blind relevance feedback — assumes the top-k results for the original query are relevant, extracts their most distinctive terms, and re-issues a second query including those terms. Rocchio’s algorithm (1971) is the classic PRF method. Effective on average but can drift badly if the initial results are poor.

Co-occurrence and distributional models mine a corpus to find terms that appear in similar contexts. Word2Vec, GloVe, and similar embeddings make it easy to retrieve nearest neighbours as expansion candidates.

LLM-based expansion — a recent pattern — prompts a language model to generate alternative phrasings, hypothetical documents, or related concepts for the query. HyDE (Hypothetical Document Embeddings, Gao et al. 2022) generates a synthetic answer document and uses its embedding as the query vector, implicitly expanding into the semantic neighbourhood of the answer.

When to use it

Query expansion is most valuable when:

  • Your users and your documents use different vocabulary (consumer queries vs. technical documentation, or multilingual corpora).
  • Recall matters more than precision — for example, e-commerce search where a missed product is a missed sale.
  • Your index is keyword-based (BM25) and you cannot justify the infrastructure cost of dense retrieval.

Be cautious about over-expansion. Adding too many loosely related terms degrades precision: results become noisy and users lose trust. Keep expansion lists narrow and well-validated; always weight expansions below the original query terms.

Prefer dense retrieval or hybrid search (keyword + vector) when semantic generalisation is the primary requirement and you have the infrastructure to support it. Query expansion is a lightweight complement to BM25, not a replacement for representation-based retrieval.

See also