Text-Analysis
-
N-Gram
An n-gram is a contiguous sequence of n tokens drawn from a text, used to capture local word order for indexing, language modelling, and similarity.
-
Tokenisation
Tokenisation is the process of splitting a raw text string into a sequence of discrete units — tokens — that downstream NLP components such as indexers, classifiers, and language models can operate on.