Longest Common Substring

What it is

The longest common substring (LCS) is the longest sequence of consecutive characters appearing identically in both strings. Unlike longest common subsequence, characters must be contiguous.

How it works

Dynamic programming approach:

Create matrix where rows are characters of string A, columns are string B
Initialise first row and column with 0
For each cell (i, j):
- If A[i-1] == B[j-1]: dp[i][j] = dp[i-1][j-1] + 1
- Else: dp[i][j] = 0
Track maximum value encountered; this is LCS length
Backtrack from max cell to reconstruct the substring

Time: O(mn), space: O(mn). Optimisable to O(min(m, n)) space with careful iteration.

[illustrate: DP matrix for “algorithm” and “logarithm” showing the longest common substring “algorithm” with cell values]

Example

LCS(“algorithm”, “logarithm”):

Longest common substring: “algorithm” (length 9) appears in both

LCS(“kitten”, “sitting”):

Common substrings: “itt” (length 3), “i”, “t”, etc.
Longest: “itt”

Variants and history

Used in plagiarism detection, DNA sequence analysis, and document comparison. Suffix arrays and suffix trees enable more efficient computation. Related to n-gram overlap metrics.

When to use it

Plagiarism detection, near-duplicate document detection, and code similarity. More restrictive than subsequence matching (contiguity matters). Useful for finding exact textual reuse. Less suitable for fuzzy matching or typo tolerance.