Corpus
What it is
A corpus (plural: corpora) is any collection of text assembled for use by an NLP system. The term is broad by design: a corpus might be a folder of PDFs on a local machine, the 6 million English articles in Wikipedia, a billion web-crawled documents, or a handful of customer support transcripts. What makes a collection of text a corpus is purpose — it has been gathered to serve as the input for indexing, training, evaluation, or statistical analysis.
Developers encounter corpora in two distinct roles:
- As the thing being searched. In a search engine, the corpus is the set of documents the index is built over. BM25 computes IDF values relative to this corpus; adding or removing documents changes those statistics.
- As training data. A language model, word embedding, or classifier is trained on a corpus. The vocabulary it learns, the associations it encodes, and the biases it carries all originate here.
The distinction matters because the same corpus can play both roles simultaneously — and because the properties of the corpus (size, domain, language balance, noise level) determine the ceiling on system quality.
How it works
A corpus is not processed as a whole in one pass. Systems interact with it through a pipeline:
- Collection — documents are gathered from one or more sources (web crawl, database export, file system, streaming API).
- Normalisation — encoding is standardised (UTF-8), boilerplate stripped, and duplicate or near-duplicate documents removed.
- Annotation (optional) — documents may be tagged with metadata: language, source, date, label, or quality score.
- Ingestion — the normalised documents are fed to an indexing or training pipeline, which tokenises each document and derives statistics (term frequencies, co-occurrence counts, embedding weights) from the resulting token streams.
The corpus is the fixed input; everything downstream is a function of it. This means that a mismatch between the corpus your model was trained on and the text it encounters at inference time — called a domain shift — is one of the most common causes of degraded NLP system performance.
[illustrate: pipeline diagram — raw document collection enters left, passes through normalisation and deduplication, splits into two paths: one into an inverted index (search corpus), one into a training loop (model corpus); statistics such as IDF and vocabulary are labelled at the output of each path]
Example
A developer builds an internal documentation search engine. The corpus is 4 000 Markdown files exported from Confluence.
| Property | Value |
|---|---|
| Documents | 4 000 |
| Vocabulary size (after tokenisation) | ~42 000 unique terms |
| Average document length | 380 tokens |
| Language | English, with some product-specific jargon |
BM25’s avgdl parameter is calculated over this corpus: 380 tokens. IDF weights are calculated over these 4 000 documents. A term like “authentication” that appears in 3 200 of them will have a low IDF and contribute little to ranking. A term like “SAML assertion” that appears in 40 documents will have a high IDF and strongly distinguish relevant results.
If that same engine were later pointed at a corpus of 400 000 legal contracts, all statistics would change — the same query would produce different rankings even with identical documents present, because IDF weights are corpus-relative.
[illustrate: before/after showing two corpora side by side — the 4 000 doc corpus and a 400 000 doc corpus — with IDF values for the same three terms (“the”, “authentication”, “SAML”) displayed beneath each, illustrating how IDF shifts as corpus composition changes]
Variants and history
The word corpus entered computational linguistics from the Latin for “body”, and was used in classical philology long before computers. The modern sense emerged alongside corpus linguistics in the 1960s–80s, when researchers first assembled machine-readable text collections for statistical analysis of language.
Reference corpora. Widely used standard collections against which systems are benchmarked: the Brown Corpus (one million words of American English, 1961), the British National Corpus (100 million words), Common Crawl (petabytes of web text used to train most large language models), and MS MARCO (8.8 million passages for evaluating retrieval systems).
Domain-specific corpora. A general-purpose corpus is usually too noisy and too broad for specialised applications. Medical NLP systems are typically trained or fine-tuned on corpora like PubMed abstracts or MIMIC clinical notes; legal systems on court judgements and statutory text. Domain specificity almost always outperforms scale for narrow tasks.
Parallel corpora. Collections where the same content exists in two or more languages, aligned at the sentence or paragraph level. Machine translation systems are trained on parallel corpora; the quality and quantity of aligned pairs is the primary determinant of translation quality.
Annotated (gold-standard) corpora. A subset of the corpus is manually labelled — for named entities, sentiment, coreference, relevance judgements, or another target signal. These labelled subsets are used to train supervised models and to evaluate system performance. Creating gold-standard annotations is expensive, which is why datasets like SQuAD, CoNLL-2003, and TREC remain widely used years after publication.
When to use it
Every NLP system has a corpus; the question is whether you are being deliberate about it.
For search systems: audit the corpus regularly. Documents that are stale, duplicated, or structurally different from user queries distort IDF statistics and degrade ranking. Deduplication and quality filtering pay off disproportionately.
For model training: the corpus determines what the model can and cannot know. A model trained on general web text will underperform on specialist jargon until it is fine-tuned on in-domain data. When performance on a narrow domain falls short, reaching for more domain-specific corpus data is usually more effective than reaching for a larger model.
For evaluation: always hold out a portion of the corpus (or use a separate test corpus) that the system has never seen during development. Evaluating on the training corpus produces inflated metrics that do not reflect real-world performance.
The practical minimum for a useful corpus varies wildly by task: a few hundred labelled examples can fine-tune a modern language model for classification; training a competitive information retrieval system from scratch typically requires millions of query–document relevance pairs.