Inflection

Morphological-Analysis Lemmatisation Stemming Preprocessing Linguistics Needs-Review

What it is

Inflection is the process by which a single word takes on different surface forms to express grammatical information, without changing its core meaning or creating a new word. The verb walk inflects into walks, walked, and walking depending on tense and agreement; the noun dog inflects into dogs for the plural. Every one of these forms belongs to the same lexeme — the abstract dictionary entry — and would be listed under a single headword in any dictionary.

This distinguishes inflection from derivation. Derivation creates a new lexeme with a shifted meaning: happy → happiness moves from adjective to noun and introduces a new concept. Inflection never does this. Walked is still walk; it just encodes past tense.

For NLP systems, inflection is the primary source of vocabulary fragmentation: the same underlying concept appears as multiple distinct strings, and a system that treats them as different terms will silently miss relevant documents or training signal.

How it works

Inflection is governed by grammatical categories — abstract dimensions of meaning that a language encodes obligatorily in word form. The categories relevant to English NLP are:

Number (nouns) Singular and plural are distinct inflectional forms of the same noun lexeme.

dog → dogs
analysis → analyses
mouse → mice (irregular)

Tense and aspect (verbs) English verbs inflect for past tense and for the progressive and participial aspects.

walk → walked (simple past / past participle)
walk → walking (present participle)

Person and number agreement (verbs) Third-person singular present requires a suffix in English, even though other persons do not.

I walk / you walk / she walks

Comparative and superlative (adjectives and adverbs) Degree is expressed inflectionally for short adjectives.

fast → faster → fastest
bad → worse → worst (suppletive irregular)

The complete set of inflected forms for a lexeme is its paradigm. A regular English verb paradigm is small — four to five forms. A Finnish noun paradigm can exceed a dozen forms across its fifteen grammatical cases; a Russian verb paradigm runs to dozens. English is morphologically lean by world-language standards, which is one reason rule-based NLP tools were largely developed on English and transfer poorly to other languages.

[illustrate: paradigm table for the lexeme WALK — five forms (walk, walks, walked, walking) arranged by grammatical category (base, 3sg present, past, past participle, present participle), with a parallel column showing the irregular lexeme GO (go, goes, went, gone, going) to contrast regular and suppletive patterns]

Regular vs irregular inflection

Regular inflection follows productive suffix rules: add -ed for past tense, -s for third-person singular, -er for comparative. A suffix-stripping rule can undo these mechanically.

Irregular inflection breaks those rules. Forms may be produced by:

Internal vowel change (ablaut): run → ran, sing → sang, drink → drank
Suppletion — historically unrelated forms merged into one paradigm: go → went, be → am / is / are / was / were
Zero inflection — plural identical to singular: sheep, fish, deer

Irregular forms cannot be recovered by suffix stripping. A rule that strips -ed from walked to yield walk will fail on ran entirely — there is no suffix to strip. Rule-based systems must supplement their suffix rules with a lookup table of irregular forms. Machine-learning approaches learn these mappings from annotated data.

[illustrate: side-by-side before/after showing a suffix-stripping rule applied to three verb past-tense forms — “walked” → “walk” (succeeds), “talked” → “talk” (succeeds), “ran” → “ran” (fails, no suffix removed) — annotated with “regular” and “irregular” labels]

Example

Consider a user querying a document collection for information about dogs barking. The raw lexemes in the query and documents might surface as any of these strings:

Surface form	Grammatical category
dog	singular noun (lemma)
dogs	plural noun
bark	verb base / lemma
barks	third-person singular present
barked	simple past
barking	present participle

Without inflection normalisation, a system indexing on raw tokens treats each row as a distinct type. A query for dogs bark will miss documents containing the dog barked even though both express the same underlying predication. Normalising all forms to their lemmas — dog and bark — collapses the paradigm and restores retrieval equivalence.

Variants and history

The formal study of inflection belongs to morphology, the branch of linguistics concerned with word structure. The distinction between inflection and derivation was codified in nineteenth-century comparative philology and remains a cornerstone of linguistic theory.

Computationally, inflection has been addressed at successive levels of sophistication:

Suffix-stripping stemmers (Porter, 1980; Lovins, 1968) approximate inflection removal without linguistic structure, trading accuracy for speed
Dictionary lookup approaches store paradigm tables explicitly — the strategy used in Hunspell’s affix files
Finite-state morphology (Koskenniemi, 1983; the two-level model) models inflection as a set of rewrite rules compiled into a finite-state transducer, enabling both analysis and generation
Neural morphological analysers learn paradigms from annotated corpora and handle irregular forms naturally

Linguistic typology distinguishes languages by how heavily they rely on inflection. Analytic languages (Mandarin, Vietnamese) express grammatical categories through word order and particles rather than inflection. Synthetic languages (Latin, Russian) encode them in suffixes. Agglutinative languages (Turkish, Finnish, Hungarian) stack morphemes transparently. Fusional languages (Latin, Russian) fuse multiple categories into a single suffix that cannot be cleanly segmented. NLP tools built for English often struggle with synthetic and agglutinative languages because their suffix inventories are far larger.

When to use it

Understanding inflection as a concept matters whenever you are designing a text normalisation pipeline:

If your pipeline uses a stemmer, be aware that stemmers approximate inflection removal but may also strip derivational suffixes, conflating semantically distinct terms (organisation and organ might collapse to the same stem). Stemmers are fast but imprecise.
If your pipeline uses a lemmatiser, inflection is precisely what it is targeting. Lemmatisers return the citation form — the dictionary headword: the infinitive for verbs (walk, not walked), the singular nominative for nouns (dog, not dogs). Lemmatisation requires either a lexicon, a morphological model, or both.
For morphologically rich languages — Arabic, Russian, Finnish, Turkish — inflection is far more consequential to retrieval quality than it is for English. A dedicated morphological analyser is usually necessary; an English-style stemmer will not suffice.
For named entities and domain-specific vocabulary, be cautious: inflection normalisation can corrupt proper nouns and technical terms that happen to look like inflected forms.