NLP Citations
The algorithms behind search and language — explained visually
NLP and information retrieval are full of ideas that are easy to state but hard to hold in your head. This reference covers the terms, techniques, and papers that practising engineers actually encounter — with step-through animations built for the moment an algorithm finally clicks.
An inverted index merge, a dynamic programming traceback, a sliding n-gram window — each one clicks into place the moment you see it move. Every concept that benefits from animation has a step-through you can pause, rewind, and replay.
Written for engineers building search systems, working with language models, or filling the gaps that most NLP courses skip over.
Browse by topic
Featured concepts
BM25
BM25 (Best Match 25) is a probabilistic ranking function that scores documents against a query by weighing term frequency and inverse document frequency with length normalisation.
interactive formula SimilarityLevenshtein Distance
Edit distance allowing insertions, deletions, and substitutions. Canonical metric for string similarity and typo tolerance.
step-through animation RetrievalPostings List
A postings list is the ordered sequence of postings for a single term in an inverted index — the list of all documents containing that term, with optional frequencies and positions.
step-through animation SimilarityCosine Similarity
Vector similarity metric: dot product / product of magnitudes. Standard measure for dense and sparse vector comparison in IR.
interactive formula TokenisationN-Gram
An n-gram is a contiguous sequence of n tokens drawn from a text, used to capture local word order for indexing, language modelling, and similarity.
step-through animationPostings list merge
An inverted index stores, for each term, a sorted list of the documents it appears in — together with positions when phrase search is needed. Executing a query means walking two or more of those lists simultaneously, advancing whichever pointer is behind. Step through the animation to see how an AND merge resolves a phrase query: the two cursors leap-frog through the postings until they land on the same document at adjacent positions, emitting a match only when both conditions hold.
Sliding window tokenisation
Many NLP tasks — fuzzy matching, near-duplicate detection, language identification — rely on breaking text into fixed-size character or word sequences. A window of width n moves one step at a time across the input, and every position produces one token. Step through the animation to watch each trigram peel off the word “colour” and see how overlapping windows capture every local context.
Levenshtein edit distance
Measuring how similar two strings are — for spell correction, record linkage, or fuzzy search — comes down to counting the cheapest sequence of single-character edits that transforms one into the other. The Levenshtein algorithm fills a dynamic programming matrix one cell at a time, each cell recording the minimum cost to align the prefixes that meet at that corner. Step through to watch the matrix fill, then follow the highlighted traceback path from the bottom-right back to the origin to read off exactly which insertions, deletions, and substitutions were chosen.