Finite State Transducer

What it is

A Finite State Transducer (FST) is a data structure mapping input strings to output strings via a set of states and transitions. FSTs generalize finite state automata by associating outputs with transitions, enabling efficient compression, morphological analysis, and term-to-ID mapping. Lucene and other search engines use FSTs for compact term dictionaries.

[illustrate: FST state diagram showing transitions labeled with input/output; example mapping words to IDs]

How it works

  1. Structure:

    • States (nodes) connected by labeled transitions
    • Each transition: input symbol(s) → output symbol(s)
    • Single final state or multiple accepting states
  2. Construction:

    • Build from sorted list of (input, output) pairs
    • Merge common suffixes (minimize states)
    • Compress via shared transitions
  3. Lookup:

    • Follow transitions matching input
    • Accumulate output along path
    • O(m) time where m = input length
  4. Applications:

    • Term dictionary (word → ID)
    • Morphological analysis (word → lemma + tags)
    • Synonym matching (word → synonyms)

Example

Term-to-ID FST:
Input "cat" → output "123"
Input "cats" → output "124"
Input "dog" → output "234"
Input "dogs" → output "235"

Shared suffix structure:
  c-a-t: 123
  c-a-t-s: 124
  d-o-g: 234
  d-o-g-s: 235

FST compresses by merging common substructure

Memory: ~8 bytes per entry (vs. hash map ~100+ bytes)
Lookup: O(m) where m = key length

Variants and history

FSTs formalize automata theory (Rabin & Scott, 1959). Applied to IR in Lucene (2004+) for compact dictionaries. DAWG (Directed Acyclic Word Graph) is special case for dictionary compression. OpenFST library popularized FSTs in NLP (phonetic conversion, morphological analysis). Trie-based variants and minimized FSTs provide different space-time tradeoffs.

When to use it

Use FSTs when:

  • Memory efficiency is critical (millions of terms)
  • Fast dictionary lookup needed
  • Morphological analysis on-the-fly
  • Synonym or variant expansion
  • Compact term dictionaries for mobile/embedded systems

FSTs excel for string-to-string mapping. For complex operations (scoring, ranking), combine with other structures.

See also