Morphological Analysis

Morphological-Analysis Stemming Lemmatisation Preprocessing Linguistics Needs-Review

What it is

Morphology is the branch of linguistics concerned with the internal structure of words. Morphological analysis is the computational process of decomposing a word into its constituent morphemes — the smallest units of a language that carry meaning or grammatical function.

A morpheme is not the same as a syllable. “Unhappiness” has three syllables but also three morphemes: the prefix un-, the free morpheme happy, and the suffix -ness. Each morpheme contributes something distinct: negation, a base meaning, and nominalisation respectively.

Morphemes fall into two broad classes:

Free morphemes stand alone as complete words: dog, run, happy.
Bound morphemes must attach to another form: -ing, un-, -ness, -ed.

Within bound morphemes, a further distinction matters for NLP:

Type	Function	Example
Prefix	Precedes the stem	un-happy, re-write, pre-process
Suffix	Follows the stem	walk-ing, kind-ness, nation-al
Inflectional	Marks grammatical properties (tense, number, case) without changing word class	dog → dogs, walk → walked
Derivational	Creates a new word with related but distinct meaning, often changing word class	happy → happiness, quick → quickly

How it works

Morphological analysis maps a surface form (the word as it appears in text) to a structured description of its parts. In the classical formalism this description includes at minimum a lexical stem and a list of affixes or features attached to it.

[illustrate: decomposition tree for “unhappiness” — root node labelled “unhappiness”, branching left to prefix “un-” and right to “happiness”, which itself branches to stem “happy” and suffix “-ness”; each node labelled with morpheme type]

The two morphological processes most relevant to NLP are:

Inflection produces grammatical variants of the same lexeme. All of walk, walks, walked, walking are inflectional forms of the same underlying word. They share a lemma and, crucially for IR, a meaning. An inflectional morpheme never changes the part of speech.

Derivation produces new — though related — lexemes. Happiness is derived from happy, but it is a distinct word with a distinct dictionary entry. Derivational morphemes frequently change the part of speech (adjective → noun here) and may shift meaning substantially.

[illustrate: two-column before/after table — left column shows a raw token list (“walked”, “dogs”, “unhappiness”, “preprocessing”); right column shows the morpheme parse for each, labelling stem, prefix, and suffix segments with colour coding]

The distinction matters because inflection is almost always worth collapsing (walking = walk for retrieval purposes), whereas derivation requires more care — unhappiness and happy are related but not interchangeable.

Example

Take the surface form “preprocessing”:

Identify the prefix: pre- (meaning “before”)
Identify the stem: process
Identify the suffix: -ing (inflectional; marks present participle / gerund)

Morphological parse: pre- + process + -ing Lemma: process (verb)

For retrieval, “preprocessing”, “preprocessed”, and “preprocesses” all reduce to the lemma process, so a query for process will match documents containing any of those forms.

Now consider the morphologically richer Turkish word “evlerinizden” (“from your houses”):

ev (house) +ler (plural) +iniz (your) +den (from)

A single surface token encodes four morphemes. English-centric approaches that treat tokens as atomic units fail entirely on input like this.

Variants and history

Morphological analysis has been studied formally since the 1960s, with the finite-state transducer (FST) emerging as the dominant formalism through the work of Kaplan and Kay at Xerox PARC in the 1980s. An FST is a bidirectional automaton: given a lexicon and a set of morphophonological rules, it can generate all surface forms of a word, or analyse a surface form back to its lexical entry and feature bundle. Implementations such as XFST, foma, and Helsinki Finite-State Toolkit remain in active use for resource-rich languages.

Three broad approaches now coexist in NLP:

1. Rule-based (finite-state / dictionary-backed) Systems such as Hunspell encode affix rules and a root lexicon explicitly. Analysis is precise and linguistically motivated, but coverage depends entirely on the lexicon — unknown words and proper nouns are handled poorly.

2. Algorithmic / heuristic Stemmers such as Porter2 and KStem apply suffix-stripping rules without consulting a dictionary. They are language-agnostic in principle, fast, and cover arbitrary vocabulary, but may produce non-word stems (“relat” from “relational”) and conflate words that should remain distinct.

3. Statistical and neural Unsupervised segmentation models such as Morfessor learn morpheme boundaries from raw text by minimising description length. Subword tokenisers — BPE (Byte Pair Encoding) and SentencePiece — discover frequent character sequences rather than linguistically principled morphemes, but achieve similar vocabulary normalisation effects and have become the default in transformer-based pipelines. These approaches handle morphologically rich and agglutinative languages far better than rule lists designed around English.

When to use it

Why it matters for IR and NLP

Without morphological normalisation, a vocabulary explodes. A corpus containing run, runs, ran, running, runner, and runners may treat these as six distinct types. Downstream models — whether a sparse inverted index or a dense embedding space — must then represent and generalise across all six separately. Morphological analysis collapses them, improving recall and reducing model complexity.

The appropriate depth of analysis depends on the task:

Goal	Approach
Broad recall in a search index	Stemming (fast, no dictionary required)
Grammatically accurate normalisation	Lemmatisation backed by a morphological analyser
Subword vocabulary for a neural model	BPE or SentencePiece
Morphologically rich language (Turkish, Finnish)	FST-based analyser or neural segmentation
Precise inflection generation (NLG)	Rule-based FST — analysis and generation are both needed

Language typology matters. English is a mildly fusional language with relatively few inflectional categories; a simple suffix-stripping stemmer handles most cases adequately. Finnish and Turkish are agglutinative — morphemes stack cleanly and a single word can express what English requires a full phrase to say. Mandarin and other isolating languages have minimal morphology; token-level analysis is largely sufficient. Choosing a morphological strategy without accounting for the target language is a common source of silent failures in multilingual systems.

Stemming vs lemmatisation are both downstream applications of morphological analysis:

Stemming approximates the stem, often by mechanical suffix removal. The result may not be a real word. It is fast and requires no lexical resource.
Lemmatisation returns the canonical dictionary form (the lemma). It requires part-of-speech context and a lexicon, but its output is always a valid word — preferable when interpretability or precision matters.