Aho-Corasick
What it is
Aho-Corasick is a string-matching algorithm that searches text for multiple patterns simultaneously in linear time O(n) + construction time. It’s far more efficient than running single-pattern matchers sequentially. Applications: synonym expansion, keyword highlighting, entity tagging, and content filtering.
[illustrate: Trie of patterns with failure links; text scanning with pattern matches highlighted; comparison to sequential single-pattern matching]
How it works
-
Construction (offline):
- Build trie from all patterns
- Add failure links (point to longest suffix match)
- O(m) preprocessing, where m = total pattern length
-
Scanning (online):
- Read text character-by-character
- Follow trie transitions; use failure links on mismatch
- O(n) for n text characters
- Report matches when reaching pattern terminals
-
Key property: Linear time with respect to text length + pattern count, independent of pattern complexity
Example
Patterns: ["cat", "dog", "at"]
Text: "The cat and dog ate the cat food"
Trie structure:
c → a → t (pattern "cat", terminal)
d → o → g (pattern "dog", terminal)
a → t (pattern "at", terminal)
Scanning text:
"The cat and dog ate..."
Matches:
Position 4: "cat" ✓
Position 11: "dog" ✓
Position 16: "at" (in "ate") ✓
Position 27: "cat" ✓
Time: O(|text|) = O(30) + construction
Variants and history
Aho-Corasick algorithm published in 1975, foundational for pattern matching. Multiple pattern matching with low overhead enabled applications like:
- Virus scanning (many signatures)
- Intrusion detection (pattern matching)
- Text processing (synonym expansion)
Variants: Wu-Manber (optimized for long patterns), regex-based extensions. Modern libraries (grep, ripgrep) often use Aho-Corasick internally.
When to use it
Use Aho-Corasick when:
- Matching many patterns against large text (efficiency)
- Synonym or keyword highlighting
- Entity or mention detection (known list of entities)
- Content filtering (banned words, toxicity screening)
- Any multi-pattern string matching
Aho-Corasick is asymptotically optimal for this problem. Practical constant factors matter; implementation details (failure link construction, memory layout) affect performance.