Trie

Data-Structures Indexing Information-Retrieval Query-Parsing Fuzzy-Matching Needs-Review

What it is

A trie (pronounced “try”, from retrieval) is a tree-shaped data structure in which each node represents a single character and every path from the root down to a marked terminal node spells out a complete term. Because the structure is keyed on characters rather than hashed values, navigating to any term of length k costs exactly k node traversals — O(k) — regardless of how many terms the trie contains.

That property makes tries the natural choice wherever the access pattern is prefix-oriented: autocomplete, dictionary lookup, IP routing tables, and the term dictionaries inside full-text search engines all exploit it. An inverted index tells you which documents contain a term; a trie tells you which terms in the dictionary share a given prefix.

How it works

Each node in a trie holds:

a character (or, in compressed variants, a character sequence)
pointers to child nodes, one per possible next character
a terminal flag indicating whether the path to this node constitutes a complete, valid term

Inserting the term "cat" walks the trie from the root following c → a → t, creating nodes as needed, and sets the terminal flag on t. Inserting "car" reuses the c → a prefix and branches at r.

[illustrate: step-by-step trie construction — insert “cat”, then “car”, then “card”, then “care”, then “dog” — show the tree after each insertion with terminal nodes marked, shared prefixes highlighted in a distinct colour]

Lookup follows the same path as insertion: traverse the characters of the query string one by one. If any character has no corresponding child, the term is absent. If all characters are consumed and the terminal flag is set, the term exists. Cost: O(k).

Prefix enumeration — the operation that powers autocomplete — works by navigating to the node at the end of the prefix, then performing a depth-first traversal of every subtree beneath it, collecting all terminal nodes reached. Every collected path is a term that begins with the query prefix.

[illustrate: prefix enumeration on a trie containing “cat”, “car”, “card”, “care”, “dog” — query prefix “ca” navigated to node ‘a’, then DFS collects “cat”, “car”, “card”, “care” with the subtree shaded]

In practice, child pointers are stored as a hash map (fast lookup, sparse children) or a fixed-size array indexed by character code (fast lookup, dense alphabet — common for lowercase ASCII).

Example

Terms indexed: "cat", "car", "card", "care", "dog"

(root)
├── c
│   └── a
│       ├── t*          ← "cat"
│       └── r*          ← "car"
│           ├── d*      ← "card"
│           └── e*      ← "care"
└── d
    └── o
        └── g*          ← "dog"

* marks terminal nodes.

Query: prefix "car"

Traverse root → c → a → r (terminal: “car” exists).
DFS from r: visit d (terminal: “card”), visit e (terminal: “care”).
Results: ["car", "card", "care"] — three terms returned in O(k + m), where m is the number of results.

Query: exact lookup "cap"

Traverse root → c → a → no child p.
Return: not found. Cost: 3 node traversals.

Variants and history

The trie was described by René de la Briandais in 1959 and named by Edward Fredkin in 1960 — Fredkin derived the name from “retrieval”, though the pronunciation drifted to “try” to avoid confusion with “tree”.

Compressed trie (Patricia trie / radix tree). Chains of single-child nodes are collapsed into a single edge labelled with the full substring. "dog" stored alone becomes a single edge dog from root rather than three nodes d → o → g. This dramatically reduces node count for sparse tries and is the basis of the radix tree used in network routing and the Patricia trie used in many production search engines.

Double-array trie. An array-based encoding that stores the entire trie in two parallel integer arrays (base and check). Lookup is a sequence of array reads rather than pointer dereferences — cache-friendly and compact. Widely used in Japanese NLP toolkits such as MeCab and in some Lucene finite-state automaton variants.

Finite-state transducer (FST). A compressed trie that also shares suffixes (not just prefixes). Where a trie shares only left context, an FST shares both left and right context, yielding much smaller structures for large vocabularies. Lucene uses an FST for its term dictionary: the entire vocabulary is encoded in a single compact byte array walked at query time.

Ternary search trie. A memory-efficient variant where each node has at most three children — less, equal, greater — rather than one per alphabet symbol. Useful when alphabet size is large or unpredictable.

When to use it

A trie is the right choice when:

Autocomplete or prefix search is a first-class requirement. Prefix enumeration is O(k + m) in a trie; achieving the same with a sorted array requires binary search plus a forward scan, and is impossible with a plain hash map.
The vocabulary is known at index time. Tries are efficient when the term set is stable or append-only. Bulk-loading is fast; in-place deletion is more complex than in a hash map.
You need lexicographic ordering for free. A depth-first traversal of a trie yields all terms in sorted order — useful for range queries and facet enumeration.

It is less suited to:

Random single-term lookup where prefixes are irrelevant. A hash map offers O(1) average-case lookup with simpler implementation. Unless prefix queries matter, a trie’s overhead is unwarranted.
Very long keys or large alphabets. Each node fans out by one character; Unicode text with thousands of possible code points makes fixed-array child storage wasteful. Compressed variants (radix trees, FSTs) are needed.
Approximate (fuzzy) matching. A trie enables exact and prefix matching efficiently, but finding terms within edit distance d of a query requires either augmenting the trie with Levenshtein automata or falling back to a character n-gram index for candidate retrieval.

In a full-text search engine, a trie or FST typically sits in front of the inverted index: the trie resolves a prefix query to a set of matching terms, and the inverted index is then consulted for each term to retrieve and score the relevant documents.