More Like This

What it is

More-like-this (MLT) analysis automatically constructs a query from a seed document’s prominent terms, then retrieves similar documents. It extracts representative terms, weighs them, and searches for documents containing these terms.

How it works

MLT algorithms:

  1. Extract tokens from seed document
  2. Weight terms by TF-IDF or similar relevance metric
  3. Select top-k terms by weight (filtering stop words and short terms)
  4. Construct a query (e.g., boolean OR with term boosts)
  5. Execute query and rank results

Variations include:

  • Term frequency threshold: skip rare or very common terms
  • Boost by TF-IDF: weight query terms by their discriminative power
  • Field-specific weighting: emphasize matches in title over body
  • Minmaxterm filter: exclude very short or long documents

[illustrate: Term extraction from seed document with TF-IDF weights; boosts applied to query terms]

Example

Seed document: paper on “Lucene search performance”

Top terms extracted: lucene (9.2), search (7.1), performance (8.5), index (6.3)

Query: lucene^9.2 OR search^7.1 OR performance^8.5 OR index^6.3

Retrieves papers discussing these topics.

Variants and history

Foundational technique in recommendation systems and information retrieval. Used in Google “similar pages”, Elasticsearch MLT API, and Solr MoreLikeThis handler. Variations include semantic similarity (embeddings) and content-based filtering.

When to use it

Recommend related articles, find duplicate documents, content-based filtering. Lightweight and fast; does not require external models. Less accurate than semantic similarity but more interpretable and computationally efficient.

See also