Subword
-
Tokeniser Vocabulary
Fixed set of subword units learned or predefined for tokenisation; typically 32k–128k tokens, balancing compression and flexibility.
-
fastText
Word embedding method using character n-grams to handle out-of-vocabulary words and morphological variants; published by Bojanowski et al. in 2017.