fastText

Embedding Fasttext Subword Character-Ngram Bojanowski Needs-Review

What it is

fastText is a library and method for learning word and sentence embeddings, developed by Facebook AI Research (Bojanowski et al., 2017). Unlike Word2Vec which treats words as atomic units, fastText represents words as bags of character n-grams. This enables strong performance on morphologically rich languages and graceful handling of out-of-vocabulary (OOV) terms.

[illustrate: Word “running” decomposed into character trigrams, embeddings aggregated, showing how unseen “runningly” inherits meaning from n-gram overlap]

How it works

Character n-grams: Break each word into overlapping subword units (typically trigrams). For “running”: <ru, run, unn, nni, nin, ing, ng>
Embedding lookup: Each n-gram has its own embedding vector
Aggregation: Word embedding = sum or mean of its constituent n-gram vectors
Training: Use skip-gram or CBOW objectives on n-gram vectors
Inference: Novel words (including OOV) are represented as sums of their n-gram vectors

Example

# "running" = sum of embeddings for <ru, run, unn, nni, nin, ing, ng>
# Misspelled "runnig" shares n-grams with "running"
# Unseen word "sprinting" inherits meaning from "-ing" suffix

similarity("running", "sprint") = high due to shared n-grams

fastText also supports supervised text classification via a similar n-gram architecture.

Variants and history

fastText emerged in 2017 as a practical improvement on Word2Vec, particularly for under-resourced languages. Its n-gram approach was inspired by earlier morphological work but applied at scale. The library became widely adopted for its efficiency, multilingual support, and robustness to spelling variations. Modern subword tokenizers (BPE, SentencePiece, WordPiece) in contextual models follow similar principles.

When to use it

Choose fastText when:

Working with morphologically rich or agglutinative languages
Handling noisy or misspelled text
OOV words are common
You need fast, lightweight embeddings
Multilingual coverage is important

fastText is efficient and robust but less precise than contextual embeddings for disambiguation. For tasks requiring syntactic or semantic nuance, contextualized models excel.