Finite State Transducer
What it is
A Finite State Transducer (FST) is a data structure mapping input strings to output strings via a set of states and transitions. FSTs generalize finite state automata by associating outputs with transitions, enabling efficient compression, morphological analysis, and term-to-ID mapping. Lucene and other search engines use FSTs for compact term dictionaries.
[illustrate: FST state diagram showing transitions labeled with input/output; example mapping words to IDs]
How it works
-
Structure:
- States (nodes) connected by labeled transitions
- Each transition: input symbol(s) → output symbol(s)
- Single final state or multiple accepting states
-
Construction:
- Build from sorted list of (input, output) pairs
- Merge common suffixes (minimize states)
- Compress via shared transitions
-
Lookup:
- Follow transitions matching input
- Accumulate output along path
- O(m) time where m = input length
-
Applications:
- Term dictionary (word → ID)
- Morphological analysis (word → lemma + tags)
- Synonym matching (word → synonyms)
Example
Term-to-ID FST:
Input "cat" → output "123"
Input "cats" → output "124"
Input "dog" → output "234"
Input "dogs" → output "235"
Shared suffix structure:
c-a-t: 123
c-a-t-s: 124
d-o-g: 234
d-o-g-s: 235
FST compresses by merging common substructure
Memory: ~8 bytes per entry (vs. hash map ~100+ bytes)
Lookup: O(m) where m = key length
Variants and history
FSTs formalize automata theory (Rabin & Scott, 1959). Applied to IR in Lucene (2004+) for compact dictionaries. DAWG (Directed Acyclic Word Graph) is special case for dictionary compression. OpenFST library popularized FSTs in NLP (phonetic conversion, morphological analysis). Trie-based variants and minimized FSTs provide different space-time tradeoffs.
When to use it
Use FSTs when:
- Memory efficiency is critical (millions of terms)
- Fast dictionary lookup needed
- Morphological analysis on-the-fly
- Synonym or variant expansion
- Compact term dictionaries for mobile/embedded systems
FSTs excel for string-to-string mapping. For complex operations (scoring, ranking), combine with other structures.