Burrows-Wheeler Transform
What it is
The Burrows-Wheeler Transform (BWT) is a reversible text permutation that rearranges characters to group similar contexts together, making text highly compressible. BWT is the foundation of bzip2 compression and FM-indexes for pattern matching. The transform is bijective (one-to-one), enabling lossless decompression.
[illustrate: Text rotations, lexicographic sorting, last column extraction; comparison showing locality of similar characters in BWT]
How it works
-
Forward transform:
- Create all rotations of input text
- Sort rotations lexicographically
- Extract last column (L)
- Save index of original string (I)
-
Properties:
- Characters with similar left context cluster in L
- L is highly compressible (many repeated characters)
- Reversible: can reconstruct original from L and I
-
Inverse transform:
- Build first column (F) from L (sort L to get F)
- Use L-to-F mapping (LF) to trace original
- Reconstruct by following indices from I
Example
Text: "banana"
All rotations (sorted):
banana
ananab
nanaba
anaban
nabana
banana (original at row 3, I=3)
Sorted:
anaban
ananab
banana
banana
nanaba
nabana
L column (last chars): "annbaa"
I = 3 (original string rank)
BWT("banana") = ("annbaa", 3)
Inverse:
L = "annbaa"
F = "aaabnn" (sorted L)
LF mapping: a→a, n→n, n→b, b→a, a→n, a→n
Reconstruct: follow indices from I=3
Variants and history
BWT invented by Burrows & Wheeler (1994). bzip2 compression uses BWT + move-to-front + Huffman. FM-index (2000) built on BWT for pattern matching. Lightweight FM-index variants balance space and time. Applications: text compression (bzip2), genomic indexing, full-text search. Theoretical interest in context modeling and compression bounds.
When to use it
Use BWT for:
- Text compression (bzip2)
- Building FM-indexes for full-text search
- Genomic sequence indexing
- Pattern matching on compressed text
- Data transformation for context-based compression
BWT transformation overhead is O(n log n) or O(n) depending on algorithm. Practical for offline preprocessing; enables fast queries.