FM-Index

What it is

The FM-index (Full-text Minute-space index) is a full-text index enabling pattern matching on compressed text without decompression. Built on the Burrows-Wheeler Transform (BWT) combined with wavelet trees, FM-indexes achieve space comparable to good compressors while supporting sub-linear pattern search. Fundamental to modern bioinformatics (genomic indexing).

[illustrate: BWT transformation → wavelet tree → pattern search showing backward search algorithm]

How it works

  1. Burrows-Wheeler Transform:

    • Rotate string all ways, sort lexicographically
    • Keep last column (L) and row index (I)
    • Highly compressible (characters cluster)
  2. Index construction:

    • Compute BWT (L column)
    • Build FM-index: L with rank/select support (via wavelet tree)
    • Track first-to-last mapping
  3. Pattern search (backward search):

    • Start at pattern’s last character
    • For each character (right to left), update range in L
    • Use FM-index rank queries to narrow range
    • O(m × log n) where m = pattern length

Example

Text: "mississippi"
BWT: "ipssm$pissii"  (L column, I = 10)

Query: Find "issi"
Backward search:
  'i': range in L
  's': update range via FM-index rank
  's': further narrow
  'i': final range

Result: [4, 4] → one match at position 4

Space: 0.5–2 bits per character (vs. 8 bits uncompressed)
Time: O(m × log n) = O(4 × log 11) ≈ O(16 operations)

Variants and history

FM-index introduced by Ferragina & Manzini (2000). Bidirectional FM-indexes enable searching left and right. Sampled FM-indexes reduce space further. Practical implementations (Bowtie, BWA) index genomes (billions of bases) efficiently. RFM-index and CFM-index improve for repetitive texts. Modern bioinformatics standard for sequence indexing (DNA, RNA).

When to use it

Use FM-indexes when:

  • Full-text search on large texts (genomes, archives)
  • Space and time both critical
  • Pattern matching dominates (insertion rare)
  • Indexed data is static
  • Exact match queries sufficient

FM-indexes are sophisticated but well-implemented (Bowtie, Samtools, SDSL). Trade-off: fast + small + queryable without decompression, but insertion/deletion complex.

See also