More Like This
What it is
More-like-this (MLT) analysis automatically constructs a query from a seed document’s prominent terms, then retrieves similar documents. It extracts representative terms, weighs them, and searches for documents containing these terms.
How it works
MLT algorithms:
- Extract tokens from seed document
- Weight terms by TF-IDF or similar relevance metric
- Select top-k terms by weight (filtering stop words and short terms)
- Construct a query (e.g., boolean OR with term boosts)
- Execute query and rank results
Variations include:
- Term frequency threshold: skip rare or very common terms
- Boost by TF-IDF: weight query terms by their discriminative power
- Field-specific weighting: emphasize matches in title over body
- Minmaxterm filter: exclude very short or long documents
[illustrate: Term extraction from seed document with TF-IDF weights; boosts applied to query terms]
Example
Seed document: paper on “Lucene search performance”
Top terms extracted: lucene (9.2), search (7.1), performance (8.5), index (6.3)
Query: lucene^9.2 OR search^7.1 OR performance^8.5 OR index^6.3
Retrieves papers discussing these topics.
Variants and history
Foundational technique in recommendation systems and information retrieval. Used in Google “similar pages”, Elasticsearch MLT API, and Solr MoreLikeThis handler. Variations include semantic similarity (embeddings) and content-based filtering.
When to use it
Recommend related articles, find duplicate documents, content-based filtering. Lightweight and fast; does not require external models. Less accurate than semantic similarity but more interpretable and computationally efficient.