Embeddings
-
Comparing BM25 and Dense Retrieval for a Product Catalogue
A side-by-side evaluation of keyword search and embedding-based search on a realistic product dataset, showing where each approach wins and how hybrid search splits the difference.
-
Unigram Language Model Tokeniser
The Unigram LM tokeniser builds a subword vocabulary top-down: it begins with a large candidate set and iteratively prunes entries that minimise the increase in corpus log-loss, producing a probability distribution over segmentations.
-
SentencePiece
SentencePiece is a language-agnostic subword tokeniser that trains directly on raw Unicode text, encodes whitespace as the ▁ symbol, and produces a fully reversible token sequence using either BPE or Unigram LM as the underlying algorithm.
-
WordPiece
WordPiece is a subword tokenisation algorithm that builds a vocabulary by iteratively merging symbol pairs chosen to maximise training-corpus likelihood, rather than raw frequency. It is the tokeniser used in BERT and its derivatives.
-
Byte Pair Encoding
Byte pair encoding is a data-compression algorithm repurposed for NLP to build subword vocabularies by iteratively merging the most frequent adjacent symbol pair in a training corpus.
-
Subword Tokenisation
Subword tokenisation splits words into smaller vocabulary units — fragments between characters and whole words — so a fixed vocabulary can represent any input string, including words never seen during training.
-
Skip-Gram
A skip-gram is a generalisation of the n-gram that allows gaps between tokens, and also the name of the Word2Vec training objective that predicts context words from a centre word.