Wavelet Tree
What it is
A wavelet tree is a succinct data structure supporting efficient rank and select queries on sequences. It recursively partitions alphabet using a binary tree of bitmaps, enabling compression (near-optimal space) with fast queries O(log σ) where σ = alphabet size. Wavelet trees underpin FM-indexes and enable compressed full-text search.
[illustrate: Wavelet tree structure showing binary partitioning of alphabet; path from root to leaf showing character encoding]
How it works
-
Construction:
- Partition alphabet into two halves recursively
- At each level, create bitmap (1 if character in upper half)
- Recursively build subtrees
- Height: log σ where σ = alphabet size
-
Rank query rank_c(position):
- Count occurrences of character c up to position
- Navigate tree: follow path determined by c’s binary encoding
- O(log σ) time
-
Select query select_c(k):
- Find position of k-th occurrence of character c
- Use binary search + rank queries
- O(log σ × log n) time
Example
Sequence: "abracadabra" (11 characters)
Alphabet: {a, b, c, d, r} (5 symbols)
Wavelet tree (binary partitioning):
Level 0: {a, b, c} vs {d, r}
bitmap: 11010101010 (a,b,c are in lower half)
Level 1a ({a,b,c}): {a} vs {b,c}
bitmap: 10100010100
Level 1b ({d,r}): {d} vs {r}
bitmap: 10
rank_a(position 5) = count('a' in first 5 chars)
Navigate: 'a' → lower at level 0, left at level 1a
Follow path, count 1s in relevant bitmaps
Answer: 3
Space: ~|seq| × log σ bits (highly compressible)
Variants and history
Wavelet trees invented by Grossi et al. (2003) for compressed text indexing. Huffman-shaped wavelet trees reduce space further by using optimal-length codes. Multiary wavelet trees generalize binary structure. Foundation of FM-indexes for full-text search without decompression. Modern applications: compressed IR, genomic sequence indexing, graph compression.
When to use it
Use wavelet trees when:
- Need fast rank/select on large sequences
- Space efficiency critical
- Full-text search on compressed data
- Building FM-indexes
- Alphabet size moderate (≤ 256 typical)
Wavelet trees are sophisticated but well-implemented in libraries (Succinct, SDSL). Trade-off: faster than decompressing entire data, slower than uncompressed arrays.