Burrows-Wheeler Transform

Burrows-Wheeler-Transform Bwt Transform Compression Reversible Needs-Review

What it is

The Burrows-Wheeler Transform (BWT) is a reversible text permutation that rearranges characters to group similar contexts together, making text highly compressible. BWT is the foundation of bzip2 compression and FM-indexes for pattern matching. The transform is bijective (one-to-one), enabling lossless decompression.

[illustrate: Text rotations, lexicographic sorting, last column extraction; comparison showing locality of similar characters in BWT]

How it works

Forward transform:
- Create all rotations of input text
- Sort rotations lexicographically
- Extract last column (L)
- Save index of original string (I)
Properties:
- Characters with similar left context cluster in L
- L is highly compressible (many repeated characters)
- Reversible: can reconstruct original from L and I
Inverse transform:
- Build first column (F) from L (sort L to get F)
- Use L-to-F mapping (LF) to trace original
- Reconstruct by following indices from I

Example

Text: "banana"
All rotations (sorted):
  banana
  ananab
  nanaba
  anaban
  nabana
  banana (original at row 3, I=3)

Sorted:
  anaban
  ananab
  banana
  banana
  nanaba
  nabana

L column (last chars): "annbaa"
I = 3 (original string rank)

BWT("banana") = ("annbaa", 3)

Inverse:
L = "annbaa"
F = "aaabnn" (sorted L)
LF mapping: a→a, n→n, n→b, b→a, a→n, a→n
Reconstruct: follow indices from I=3

Variants and history

BWT invented by Burrows & Wheeler (1994). bzip2 compression uses BWT + move-to-front + Huffman. FM-index (2000) built on BWT for pattern matching. Lightweight FM-index variants balance space and time. Applications: text compression (bzip2), genomic indexing, full-text search. Theoretical interest in context modeling and compression bounds.

When to use it

Use BWT for:

Text compression (bzip2)
Building FM-indexes for full-text search
Genomic sequence indexing
Pattern matching on compressed text
Data transformation for context-based compression

BWT transformation overhead is O(n log n) or O(n) depending on algorithm. Practical for offline preprocessing; enables fast queries.