Comparing BM25 and Dense Retrieval for a Product Catalogue

Introduction

A customer types “waterproof hiking boot size 10” into your store’s search box. They also type “something to keep my feet dry on trails”. Both queries intend the same product, but they pull in opposite directions for a keyword ranker.

This article puts BM25 and dense retrieval head-to-head on a product catalogue in OpenSearch. You will configure both retrieval strategies, run them against the same query set, measure where each one fails, and then combine them into a hybrid pipeline that beats either in isolation. The implementation uses OpenSearch 2.13 and the all-MiniLM-L6-v2 sentence embedding model served via the OpenSearch ML node.

Prerequisites

  • OpenSearch 2.13+ cluster with the ML plugin enabled (single-node dev cluster is fine)
  • Python 3.11+ with opensearch-py and sentence-transformers installed
  • Familiarity with OpenSearch index mappings and query DSL
  • A working understanding of BM25 scoring — see BM25 if you need a refresher
pip install opensearch-py sentence-transformers

Concept primer

BM25 and the vocabulary mismatch problem

BM25 scores documents by counting how often query terms appear in them, weighted by how rare those terms are across the corpus (IDF). It is exact-match at its core — a document only scores if it shares tokens with the query. This is why “waterproof hiking boot” matches well but “something to keep my feet dry on trails” does not: the query tokens have almost no overlap with typical product titles.

This failure mode is called vocabulary mismatch: two strings mean the same thing but share no words. It is endemic to product search because customers use natural descriptions while catalogues use brand and category terminology.

Dense retrieval and semantic similarity

Dense retrieval encodes both queries and documents into fixed-length vectors using a neural encoder (typically a fine-tuned transformer). Relevance is measured as cosine similarity or dot product between vectors in that shared embedding space. Semantically related strings cluster together even when they share no tokens — “waterproof” and “keeps feet dry” end up close because the model has learned that they co-occur in similar contexts.

The tradeoff: dense retrieval requires inference at query time, a vector index in addition to the inverted index, and a model that was trained on domain-relevant data. On exact product code queries (“B07XK2DKV3”) or highly specific technical specs (“M12 bolt 40mm”), BM25 wins cleanly because there is no ambiguity — token match is both necessary and sufficient.

Hybrid search runs both retrievers in parallel and combines their scores using Reciprocal Rank Fusion (RRF) or a normalised linear combination. It captures the precision of BM25 on keyword queries and the recall of dense retrieval on semantic ones. OpenSearch 2.10+ ships RRF as a built-in query type under the hybrid query clause.

[illustrate: pipeline diagram — a query entering two parallel tracks labelled “BM25 / inverted index” and “kNN / vector index”, both producing ranked lists, then merging into a single ranked list via RRF; annotate with latency budget at each stage]


Step-by-step implementation

Step 1 — Create the product dataset

We use 500 synthetic product records in a single file. Each record has a title, description, and category. Save this as products.json — a snippet is shown below:

[
  {
    "id": "P001",
    "title": "Merrell Moab 3 Waterproof Hiking Boot",
    "description": "Full-grain leather and mesh upper with Vibram TC5+ outsole. Rated waterproof to 1 m submersion. Available in widths D and 2E.",
    "category": "footwear"
  },
  {
    "id": "P002",
    "title": "Columbia Newton Ridge Plus II Suede",
    "description": "Suede upper, Omni-Grip non-marking traction rubber. Water resistant, not waterproof. Lightweight and packable.",
    "category": "footwear"
  },
  {
    "id": "P003",
    "title": "Therm-a-Rest NeoAir XLite NXT Sleeping Pad",
    "description": "Ultralight inflatable pad, R-value 4.5, packs to 1 litre. Reflective ThermaCapture layer traps radiant heat.",
    "category": "sleep-systems"
  }
]

Step 2 — Set up the OpenSearch ML model

Register and deploy all-MiniLM-L6-v2 via the OpenSearch ML node. This model produces 384-dimensional vectors.

from opensearchpy import OpenSearch

client = OpenSearch(
    hosts=[{"host": "localhost", "port": 9200}],
    http_auth=("admin", "admin"),
    use_ssl=False,
)

# Register model from the ML Commons model hub
register_body = {
    "name": "huggingface/sentence-transformers/all-MiniLM-L6-v2",
    "version": "1.0.1",  # check the ML Commons model hub for the current version
    "model_format": "TORCH_SCRIPT",
}
response = client.transport.perform_request(
    "POST", "/_plugins/_ml/models/_register", body=register_body
)
model_id = response["model_id"]

# Deploy
client.transport.perform_request(
    "POST", f"/_plugins/_ml/models/{model_id}/_deploy"
)
print(f"Model deployed: {model_id}")

Keep the model_id value — you will need it in the ingest pipeline and at query time. Version 1.0.1 was current as of OpenSearch 2.13. Check the ML Commons model hub for the latest published version before registering.

Step 3 — Create the index mapping

The index needs both a text field for BM25 and a knn_vector field for dense retrieval. We combine title and description into a single search_text field for BM25, and store the embedding in search_vector.

PUT /products
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "id":          { "type": "keyword" },
      "title":       { "type": "text", "analyzer": "english" },
      "description": { "type": "text", "analyzer": "english" },
      "category":    { "type": "keyword" },
      "search_text": {
        "type": "text",
        "analyzer": "english"
      },
      "search_vector": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "name": "hnsw",
          "space_type": "cosinesimil",
          "engine": "nmslib",
          "parameters": { "ef_construction": 128, "m": 24 }
        }
      }
    }
  }
}

knn: true on the index activates the HNSW graph. ef_construction and m trade index build time and memory against recall — these values are reasonable defaults for a catalogue under 1 M documents. Malkov & Yashunin (2018) “Efficient and Robust Approximate Nearest Neighbor Search Using HNSW” is the canonical reference for tuning these parameters to your recall and latency budget.

Step 4 — Index documents with embeddings

from sentence_transformers import SentenceTransformer
import json

model = SentenceTransformer("all-MiniLM-L6-v2")

with open("products.json") as f:
    products = json.load(f)

actions = []
for product in products:
    search_text = f"{product['title']} {product['description']}"
    vector = model.encode(search_text).tolist()

    doc = {
        **product,
        "search_text": search_text,
        "search_vector": vector,
    }
    actions.append({"index": {"_index": "products", "_id": product["id"]}})
    actions.append(doc)

# Bulk index in batches of 100
batch_size = 200  # 100 documents = 200 action lines
for i in range(0, len(actions), batch_size):
    client.bulk(body=actions[i : i + batch_size])

print(f"Indexed {len(products)} products")

Step 5 — BM25 retrieval

A standard multi-match query across title (boosted) and description:

def search_bm25(query: str, size: int = 10) -> list[dict]:
    body = {
        "size": size,
        "query": {
            "multi_match": {
                "query": query,
                "fields": ["title^3", "description"],
                "type": "best_fields",
                "operator": "or",
            }
        },
        "_source": ["id", "title", "category"],
    }
    response = client.search(index="products", body=body)
    return response["hits"]["hits"]

The ^3 boost on title reflects the intuition that a term appearing in a product title is a stronger relevance signal than the same term buried in a description paragraph.

Step 6 — Dense retrieval (kNN)

Encode the query with the same model used at index time, then run an approximate nearest-neighbour search:

def search_dense(query: str, size: int = 10) -> list[dict]:
    query_vector = model.encode(query).tolist()
    body = {
        "size": size,
        "query": {
            "knn": {
                "search_vector": {
                    "vector": query_vector,
                    "k": size,
                }
            }
        },
        "_source": ["id", "title", "category"],
    }
    response = client.search(index="products", body=body)
    return response["hits"]["hits"]

Step 7 — Hybrid retrieval with RRF

OpenSearch 2.10+ supports hybrid queries that pipe results through a normalisation processor and RRF combiner defined in a search pipeline.

First, create the search pipeline:

PUT /_search/pipeline/hybrid-rrf
{
  "description": "BM25 + kNN with RRF combination",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": { "technique": "min_max" },
        "combination": {
          "technique": "rrf",
          "parameters": { "rank_constant": 60 }
        }
      }
    }
  ]
}

Then query using both sub-queries inside a hybrid clause:

def search_hybrid(query: str, size: int = 10) -> list[dict]:
    query_vector = model.encode(query).tolist()
    body = {
        "size": size,
        "query": {
            "hybrid": {
                "queries": [
                    {
                        "multi_match": {
                            "query": query,
                            "fields": ["title^3", "description"],
                            "type": "best_fields",
                        }
                    },
                    {
                        "knn": {
                            "search_vector": {
                                "vector": query_vector,
                                "k": size,
                            }
                        }
                    },
                ]
            }
        },
        "_source": ["id", "title", "category"],
    }
    response = client.search(
        index="products",
        body=body,
        params={"search_pipeline": "hybrid-rrf"},
    )
    return response["hits"]["hits"]

rank_constant of 60 is the standard RRF default. Increasing it flattens the score differences between ranks; decreasing it sharpens them.


How it works

Why BM25 fails on natural-language descriptions

BM25 builds a score from shared tokens. “Something to keep my feet dry on trails” contains none of the tokens in “Merrell Moab 3 Waterproof Hiking Boot” — so the score is zero and the product does not appear in results at all. The inverted index is only consulted for tokens that exist; there is no mechanism to bridge the vocabulary gap.

[illustrate: before/after query transformation — left column shows raw query “something to keep my feet dry on trails” with zero-overlap tokens highlighted against a product title; right column shows the same query encoded to a 384-dim vector and plotted in 2D UMAP projection alongside 5 candidate product vectors, with the correct product circled]

Why dense retrieval fails on exact specifications

When a customer queries “M12 hex bolt 40mm stainless A4-70”, the embedding model averages subword representations into a single vector. Subtle differences in specs (“A4-70” vs “A2-70”) compress into nearby but not identical points, and approximate nearest-neighbour search with HNSW may return the wrong grade. BM25’s exact token match on “A4-70” is decisive here.

Dense retrieval also suffers from index-query distribution shift: if the model was not trained on product catalogue text, embeddings for product codes, brand names, and material specifications may not cluster meaningfully. Thakur et al. (BEIR, 2021) benchmark this effect systematically across 18 domain-shifted corpora — a useful reference when evaluating whether a general-purpose encoder will transfer to your catalogue.

How RRF combines the rankings

RRF converts each result list into a reciprocal rank score — 1 / (rank + k) — then sums the scores across lists. A document ranked 1st in BM25 and 2nd in kNN will comfortably outscore a document ranked 1st in one list but absent from the other. This is not simple re-scoring — it rewards consistent agreement across retrievers.

Hybrid does not simply re-score — it rewards consistent agreement across retrievers.

[illustrate: RRF merge — two ranked lists of 5 items each side by side, with arrows showing how items shared between lists have their reciprocal rank scores summed, and the merged ranked list on the right showing re-ordered results]


Testing and validation

Build an evaluation query set

Create 30–50 manually labelled queries covering three types:

Type Example Expected winner
Exact keyword “Merrell Moab 3” BM25
Semantic / natural language “something to keep my feet dry on trails” Dense
Specification with context “lightweight sleeping pad under 500g warm weather” Hybrid
eval_queries = [
    {
        "query": "Merrell Moab 3",
        "relevant_ids": ["P001"],
        "type": "keyword",
    },
    {
        "query": "something to keep my feet dry on trails",
        "relevant_ids": ["P001", "P002"],
        "type": "semantic",
    },
    {
        "query": "lightweight sleeping pad under 500g warm weather",
        "relevant_ids": ["P003"],
        "type": "spec-with-context",
    },
]

Measure Precision@5

def precision_at_k(results: list[dict], relevant_ids: set[str], k: int = 5) -> float:
    top_k_ids = {hit["_source"]["id"] for hit in results[:k]}
    return len(top_k_ids & relevant_ids) / k

for q in eval_queries:
    relevant = set(q["relevant_ids"])
    bm25_results  = search_bm25(q["query"])
    dense_results = search_dense(q["query"])
    hybrid_results = search_hybrid(q["query"])

    print(f"Query: {q['query']!r} [{q['type']}]")
    print(f"  BM25   P@5: {precision_at_k(bm25_results,  relevant):.2f}")
    print(f"  Dense  P@5: {precision_at_k(dense_results, relevant):.2f}")
    print(f"  Hybrid P@5: {precision_at_k(hybrid_results, relevant):.2f}")

Expected result pattern

On a well-labelled product catalogue with the setup above, you should observe a pattern roughly like this:

Query type BM25 P@5 Dense P@5 Hybrid P@5
Exact keyword 0.90 0.55 0.85
Semantic 0.15 0.75 0.72
Spec with context 0.60 0.65 0.78

BM25 is dominant on exact keyword queries but falls sharply on semantic ones. Dense retrieval recovers most semantic queries but underperforms on exact matches. Hybrid rarely tops either specialist but is the most consistent across types — which is what matters in a production catalogue where query intent is unknown at serving time.


Tradeoffs and alternatives

BM25 alone is the right choice if your catalogue is highly keyword-structured (SKUs, part numbers, well-controlled terminology) and query traffic is predictable. It requires no GPU, no embedding inference, and index updates are cheap — a product added to the catalogue is searchable in milliseconds.

Dense retrieval alone makes sense when your queries are almost entirely natural language (e.g., a conversational product discovery interface) and you have the infrastructure to run embedding inference reliably. It requires keeping the query-side model in sync with the index-side model — a model upgrade means re-indexing the entire catalogue.

Hybrid search is the pragmatic default for general-purpose product search. The main costs are:

  • Additional index storage for vectors (384 floats × 4 bytes × N documents)
  • Query latency increases slightly because the kNN search and BM25 search run in parallel but the slowest one sets the floor
  • The search pipeline adds a small coordination overhead

Alternatives worth considering:

  • BM25 + query expansion — using a synonym map or LLM-generated query rewrites to bridge vocabulary mismatch without adding a vector index. Lower infrastructure cost, lower recall ceiling.
  • Learned sparse retrieval (SPLADE, OpenSearch Neural Sparse) — a middle ground that learns a sparse representation over the vocabulary, avoiding the vocabulary mismatch problem while staying in the inverted index paradigm. Available in OpenSearch 2.11+ as neural_sparse query type.
  • Re-ranking — run BM25 as a first-stage retriever to get a candidate set of 100–500 documents, then apply a cross-encoder or LLM-based re-ranker over that set. Substantially better quality than bi-encoder dense retrieval at the cost of higher per-query compute.

Further reading

  • BM25 — full explanation of the scoring formula and its parameters
  • Inverted Index — the data structure BM25 queries against
  • Query Expansion — a lighter-weight alternative to dense retrieval for vocabulary mismatch
  • Token — how text is normalised before it reaches the BM25 scorer