Stop List
What it is
A stop list is a manually curated set of high-frequency function words that provide little semantic content: articles (the, a), conjunctions (and, or), prepositions (in, on), pronouns (I, you). Stop words are filtered during preprocessing to reduce index size, noise, and computational cost. Stop lists are language and domain-specific.
[illustrate: Text before and after stop-word filtering; size reduction and focus on content words]
How it works
-
Stop word definition:
- Function words with high frequency and low discriminative power
- Typical: articles, prepositions, conjunctions, pronouns, auxiliary verbs
- Language-specific: English ~100–500 stop words
-
Filtering:
- During preprocessing, remove tokens in stop list
- Example: “The quick brown fox” → [“quick”, “brown”, “fox”]
-
Trade-offs:
- Pro: Reduce index size (30–50% reduction), reduce noise
- Con: Lose grammatical information, break phrases (“New York” → [“York”])
Example
Stop list (English):
{the, a, an, and, or, but, in, on, at, for, with, from, to, is, are, be, ...}
Original text:
"The quick brown fox jumped over the lazy dog in the meadow"
After stop-word filtering:
["quick", "brown", "fox", "jumped", "lazy", "dog", "meadow"]
Index size reduction: ~40%
Content preserved: mainly content words (nouns, verbs, adjectives)
Variants and history
Stop word filtering is classical in IR (1950s–60s). Early systems used simple lists (top 100 frequent words). Language-specific lists developed for major languages. Modern systems often avoid stop-word filtering (BERT, dense retrieval handle them implicitly), but it remains useful for efficiency. Domain-specific stop lists remove domain-frequent non-content words (e.g., “paper”, “abstract” in academic text).
When to use it
Use stop lists when:
- Building inverted-index IR systems (efficiency)
- Reducing preprocessing noise for traditional NLP
- Analyzing corpus (focus on content words)
- Resource-constrained scenarios (reduce index size)
- Language learning (focus on content)
Modern neural models (BERT, transformers) don’t require stop-word filtering (they learn implicitly). Filtering reduces flexibility but improves efficiency in sparse systems.