Embeddings

Comparing BM25 and Dense Retrieval for a Product Catalogue

A side-by-side evaluation of keyword search and embedding-based search on a realistic product dataset, showing where each approach wins and how hybrid search splits the difference.
Unigram Language Model Tokeniser

The Unigram LM tokeniser builds a subword vocabulary top-down: it begins with a large candidate set and iteratively prunes entries that minimise the increase in corpus log-loss, producing a probability distribution over segmentations.
SentencePiece

SentencePiece is a language-agnostic subword tokeniser that trains directly on raw Unicode text, encodes whitespace as the ▁ symbol, and produces a fully reversible token sequence using either BPE or Unigram LM as the underlying algorithm.
WordPiece

WordPiece is a subword tokenisation algorithm that builds a vocabulary by iteratively merging symbol pairs chosen to maximise training-corpus likelihood, rather than raw frequency. It is the tokeniser used in BERT and its derivatives.
Byte Pair Encoding

Byte pair encoding is a data-compression algorithm repurposed for NLP to build subword vocabularies by iteratively merging the most frequent adjacent symbol pair in a training corpus.
Subword Tokenisation

Subword tokenisation splits words into smaller vocabulary units — fragments between characters and whole words — so a fixed vocabulary can represent any input string, including words never seen during training.
Skip-Gram

A skip-gram is a generalisation of the n-gram that allows gaps between tokens, and also the name of the Word2Vec training objective that predicts context words from a centre word.

Embeddings

Comparing BM25 and Dense Retrieval for a Product Catalogue

Unigram Language Model Tokeniser

SentencePiece

WordPiece

Byte Pair Encoding

Subword Tokenisation

Skip-Gram