FM-Index
What it is
The FM-index (Full-text Minute-space index) is a full-text index enabling pattern matching on compressed text without decompression. Built on the Burrows-Wheeler Transform (BWT) combined with wavelet trees, FM-indexes achieve space comparable to good compressors while supporting sub-linear pattern search. Fundamental to modern bioinformatics (genomic indexing).
[illustrate: BWT transformation → wavelet tree → pattern search showing backward search algorithm]
How it works
-
Burrows-Wheeler Transform:
- Rotate string all ways, sort lexicographically
- Keep last column (L) and row index (I)
- Highly compressible (characters cluster)
-
Index construction:
- Compute BWT (L column)
- Build FM-index: L with rank/select support (via wavelet tree)
- Track first-to-last mapping
-
Pattern search (backward search):
- Start at pattern’s last character
- For each character (right to left), update range in L
- Use FM-index rank queries to narrow range
- O(m × log n) where m = pattern length
Example
Text: "mississippi"
BWT: "ipssm$pissii" (L column, I = 10)
Query: Find "issi"
Backward search:
'i': range in L
's': update range via FM-index rank
's': further narrow
'i': final range
Result: [4, 4] → one match at position 4
Space: 0.5–2 bits per character (vs. 8 bits uncompressed)
Time: O(m × log n) = O(4 × log 11) ≈ O(16 operations)
Variants and history
FM-index introduced by Ferragina & Manzini (2000). Bidirectional FM-indexes enable searching left and right. Sampled FM-indexes reduce space further. Practical implementations (Bowtie, BWA) index genomes (billions of bases) efficiently. RFM-index and CFM-index improve for repetitive texts. Modern bioinformatics standard for sequence indexing (DNA, RNA).
When to use it
Use FM-indexes when:
- Full-text search on large texts (genomes, archives)
- Space and time both critical
- Pattern matching dominates (insertion rare)
- Indexed data is static
- Exact match queries sufficient
FM-indexes are sophisticated but well-implemented (Bowtie, Samtools, SDSL). Trade-off: fast + small + queryable without decompression, but insertion/deletion complex.