Suffix Tree

What it is

A suffix tree is a compressed trie of all suffixes of a string, supporting O(m) pattern matching for pattern of length m. Unlike suffix arrays (binary search O(m log n)), suffix trees directly navigate to pattern matches. Trade-off: space-expensive (O(n) pointers, ~20n bytes) vs. time-optimal (O(m) search). Used for theoretical algorithms and specialized string-matching problems.

[illustrate: Suffix tree structure showing compressed edges; pattern matching path from root to leaf; comparison to suffix array]

How it works

  1. Construction:

    • Insert all suffixes into trie
    • Compress chains of single-child nodes
    • Result: tree with O(n) nodes
    • Build time: O(n) with Ukkonen’s algorithm
  2. Pattern search:

    • Follow edges from root matching pattern
    • O(m) time (no binary search needed)
    • Find all occurrences in O(m + matches)
  3. Space: O(n) pointers; practically ~20 bytes per character

Example

Text: "banana$"
Suffix tree (simplified):
        root
       /    \
      $     a—————nana
      |    / \      |
      6   b   n—ana 3
          |   | \
          0   4  2
              |
              $
              5

Search "ana":
  From root, follow 'a' → 'n' → 'a'
  Found at node, return positions {3, 1}
  Time: O(3) = O(m)

Variants and history

Suffix trees introduced by Weiner (1973). Ukkonen’s algorithm (1995) constructs in linear time O(n), making practical. Generalized suffix trees handle multiple strings. Suffix tree construction is theoretically elegant but practically complex; suffix arrays often preferred for simplicity. Applications: longest common substring, repeats, periodicity analysis.

When to use it

Use suffix trees when:

  • Pattern matching speed is critical (O(m) vs. O(m log n))
  • Space is less constrained
  • Complex string algorithms (longest repeats, periods)
  • Multiple queries benefit from one index
  • Theoretical algorithms (not practical IR)

Suffix trees are time-optimal but space-expensive. Suffix arrays and FM-indexes usually preferred in practice due to space efficiency and similar query performance.

See also