fastText
What it is
fastText is a library and method for learning word and sentence embeddings, developed by Facebook AI Research (Bojanowski et al., 2017). Unlike Word2Vec which treats words as atomic units, fastText represents words as bags of character n-grams. This enables strong performance on morphologically rich languages and graceful handling of out-of-vocabulary (OOV) terms.
[illustrate: Word “running” decomposed into character trigrams, embeddings aggregated, showing how unseen “runningly” inherits meaning from n-gram overlap]
How it works
- Character n-grams: Break each word into overlapping subword units (typically trigrams). For “running”:
<ru,run,unn,nni,nin,ing,ng> - Embedding lookup: Each n-gram has its own embedding vector
- Aggregation: Word embedding = sum or mean of its constituent n-gram vectors
- Training: Use skip-gram or CBOW objectives on n-gram vectors
- Inference: Novel words (including OOV) are represented as sums of their n-gram vectors
Example
# "running" = sum of embeddings for <ru, run, unn, nni, nin, ing, ng>
# Misspelled "runnig" shares n-grams with "running"
# Unseen word "sprinting" inherits meaning from "-ing" suffix
similarity("running", "sprint") = high due to shared n-grams
fastText also supports supervised text classification via a similar n-gram architecture.
Variants and history
fastText emerged in 2017 as a practical improvement on Word2Vec, particularly for under-resourced languages. Its n-gram approach was inspired by earlier morphological work but applied at scale. The library became widely adopted for its efficiency, multilingual support, and robustness to spelling variations. Modern subword tokenizers (BPE, SentencePiece, WordPiece) in contextual models follow similar principles.
When to use it
Choose fastText when:
- Working with morphologically rich or agglutinative languages
- Handling noisy or misspelled text
- OOV words are common
- You need fast, lightweight embeddings
- Multilingual coverage is important
fastText is efficient and robust but less precise than contextual embeddings for disambiguation. For tasks requiring syntactic or semantic nuance, contextualized models excel.