Longest Common Substring
What it is
The longest common substring (LCS) is the longest sequence of consecutive characters appearing identically in both strings. Unlike longest common subsequence, characters must be contiguous.
How it works
Dynamic programming approach:
- Create matrix where rows are characters of string A, columns are string B
- Initialise first row and column with 0
- For each cell (i, j):
- If A[i-1] == B[j-1]: dp[i][j] = dp[i-1][j-1] + 1
- Else: dp[i][j] = 0
- Track maximum value encountered; this is LCS length
- Backtrack from max cell to reconstruct the substring
Time: O(mn), space: O(mn). Optimisable to O(min(m, n)) space with careful iteration.
[illustrate: DP matrix for “algorithm” and “logarithm” showing the longest common substring “algorithm” with cell values]
Example
LCS(“algorithm”, “logarithm”):
- Longest common substring: “algorithm” (length 9) appears in both
LCS(“kitten”, “sitting”):
- Common substrings: “itt” (length 3), “i”, “t”, etc.
- Longest: “itt”
Variants and history
Used in plagiarism detection, DNA sequence analysis, and document comparison. Suffix arrays and suffix trees enable more efficient computation. Related to n-gram overlap metrics.
When to use it
Plagiarism detection, near-duplicate document detection, and code similarity. More restrictive than subsequence matching (contiguity matters). Useful for finding exact textual reuse. Less suitable for fuzzy matching or typo tolerance.