Suffix Array
What it is
A suffix array is an array of integers representing lexicographically sorted suffixes of a string. It provides space-efficient full-text indexing: space O(n log n) bits (vs. suffix trees O(n) pointers). Pattern matching takes O(m log n) time (binary search over sorted suffixes). Standard in IR and bioinformatics for exact match queries.
[illustrate: Text with all suffixes listed and sorted; binary search for pattern; comparison to suffix tree space]
How it works
-
Construction:
- Extract all n suffixes of text
- Sort suffixes lexicographically
- Store array of starting positions
- O(n log^2 n) naive; O(n) with advanced algorithms
-
Pattern search (binary search):
- Look for pattern p in sorted suffix array
- Find range of suffixes starting with p
- Return positions from range
-
Time/Space:
- Search: O(m log n) where m = pattern length
- Space: O(n log n) bits (more compact than suffix tree)
Example
Text: "banana$"
Suffixes (with positions):
Position 0: "banana$"
Position 1: "anana$"
Position 2: "nana$"
Position 3: "ana$"
Position 4: "na$"
Position 5: "a$"
Position 6: "$"
Sorted:
"$" (6)
"a$" (5)
"ana$" (3)
"anana$" (1)
"banana$" (0)
"na$" (4)
"nana$" (2)
Suffix array: [6, 5, 3, 1, 0, 4, 2]
Search "na":
Binary search for "na" in sorted order
Range: positions 4-5 in SA
Matches: positions 4, 2 in original text
Variants and history
Suffix arrays introduced by Manber & Myers (1990) as space-efficient alternative to suffix trees. DC3 algorithm (Kärkkainen & Sanders, 2003) constructs in linear time O(n). Enhanced suffix arrays add additional structures (LCP array) for faster queries. Practical implementations (SDSL, Bowtie) handle large texts. Standard in bioinformatics and modern search engines.
When to use it
Use suffix arrays when:
- Full-text pattern matching on static text
- Space efficiency critical (DNA, large texts)
- Construction time less important than query speed
- Index fits in memory (billions of chars)
- Exact match queries dominate
Suffix arrays balance space (compact) and time (logarithmic search). FM-indexes improve space further; suffix trees improve time.