Text-Analysis

N-Gram

An n-gram is a contiguous sequence of n tokens drawn from a text, used to capture local word order for indexing, language modelling, and similarity.
Tokenisation

Tokenisation is the process of splitting a raw text string into a sequence of discrete units — tokens — that downstream NLP components such as indexers, classifiers, and language models can operate on.