Burrows-Wheeler Transform

What it is

The Burrows-Wheeler Transform (BWT) is a reversible text permutation that rearranges characters to group similar contexts together, making text highly compressible. BWT is the foundation of bzip2 compression and FM-indexes for pattern matching. The transform is bijective (one-to-one), enabling lossless decompression.

[illustrate: Text rotations, lexicographic sorting, last column extraction; comparison showing locality of similar characters in BWT]

How it works

  1. Forward transform:

    • Create all rotations of input text
    • Sort rotations lexicographically
    • Extract last column (L)
    • Save index of original string (I)
  2. Properties:

    • Characters with similar left context cluster in L
    • L is highly compressible (many repeated characters)
    • Reversible: can reconstruct original from L and I
  3. Inverse transform:

    • Build first column (F) from L (sort L to get F)
    • Use L-to-F mapping (LF) to trace original
    • Reconstruct by following indices from I

Example

Text: "banana"
All rotations (sorted):
  banana
  ananab
  nanaba
  anaban
  nabana
  banana (original at row 3, I=3)

Sorted:
  anaban
  ananab
  banana
  banana
  nanaba
  nabana

L column (last chars): "annbaa"
I = 3 (original string rank)

BWT("banana") = ("annbaa", 3)

Inverse:
L = "annbaa"
F = "aaabnn" (sorted L)
LF mapping: a→a, n→n, n→b, b→a, a→n, a→n
Reconstruct: follow indices from I=3

Variants and history

BWT invented by Burrows & Wheeler (1994). bzip2 compression uses BWT + move-to-front + Huffman. FM-index (2000) built on BWT for pattern matching. Lightweight FM-index variants balance space and time. Applications: text compression (bzip2), genomic indexing, full-text search. Theoretical interest in context modeling and compression bounds.

When to use it

Use BWT for:

  • Text compression (bzip2)
  • Building FM-indexes for full-text search
  • Genomic sequence indexing
  • Pattern matching on compressed text
  • Data transformation for context-based compression

BWT transformation overhead is O(n log n) or O(n) depending on algorithm. Practical for offline preprocessing; enables fast queries.

See also