Aho-Corasick

What it is

Aho-Corasick is a string-matching algorithm that searches text for multiple patterns simultaneously in linear time O(n) + construction time. It’s far more efficient than running single-pattern matchers sequentially. Applications: synonym expansion, keyword highlighting, entity tagging, and content filtering.

[illustrate: Trie of patterns with failure links; text scanning with pattern matches highlighted; comparison to sequential single-pattern matching]

How it works

  1. Construction (offline):

    • Build trie from all patterns
    • Add failure links (point to longest suffix match)
    • O(m) preprocessing, where m = total pattern length
  2. Scanning (online):

    • Read text character-by-character
    • Follow trie transitions; use failure links on mismatch
    • O(n) for n text characters
    • Report matches when reaching pattern terminals
  3. Key property: Linear time with respect to text length + pattern count, independent of pattern complexity

Example

Patterns: ["cat", "dog", "at"]
Text: "The cat and dog ate the cat food"

Trie structure:
  c → a → t (pattern "cat", terminal)
  d → o → g (pattern "dog", terminal)
  a → t (pattern "at", terminal)

Scanning text:
"The cat and dog ate..."
Matches:
  Position 4: "cat" ✓
  Position 11: "dog" ✓
  Position 16: "at" (in "ate") ✓
  Position 27: "cat" ✓

Time: O(|text|) = O(30) + construction

Variants and history

Aho-Corasick algorithm published in 1975, foundational for pattern matching. Multiple pattern matching with low overhead enabled applications like:

  • Virus scanning (many signatures)
  • Intrusion detection (pattern matching)
  • Text processing (synonym expansion)

Variants: Wu-Manber (optimized for long patterns), regex-based extensions. Modern libraries (grep, ripgrep) often use Aho-Corasick internally.

When to use it

Use Aho-Corasick when:

  • Matching many patterns against large text (efficiency)
  • Synonym or keyword highlighting
  • Entity or mention detection (known list of entities)
  • Content filtering (banned words, toxicity screening)
  • Any multi-pattern string matching

Aho-Corasick is asymptotically optimal for this problem. Practical constant factors matter; implementation details (failure link construction, memory layout) affect performance.

See also