Longest Common Subsequence

What it is

The longest common subsequence (LCS) is the longest sequence of characters that appear in the same order in both strings, though not necessarily consecutively. LCS length is a measure of similarity; LCS itself is useful for diff and merging operations.

How it works

Dynamic programming approach:

  1. Create matrix where rows are characters of string A, columns are string B
  2. Initialise first row and column with 0
  3. For each cell (i, j):
    • If A[i-1] == B[j-1]: dp[i][j] = dp[i-1][j-1] + 1
    • Else: dp[i][j] = max(dp[i-1][j], dp[i][j-1])
  4. Bottom-right cell contains LCS length
  5. Backtrack to reconstruct the LCS string

Time: O(mn), space: O(mn) for strings of length m and n.

Example

LCS(“algorithm”, “altruism”):

  • LCS: “altrim” (length 6)
  • One alignment: a-l-g-o-r-i-t-h-m vs a-l-t-r-u-i-s-m
  • Common subsequence: a, l, r, i, m (in order)

Variants and history

Foundation for diff algorithms (Unix diff, git diff) showing code changes. Related to edit distance but captures order-preserving similarity. Generalises to multiple sequences (LCS of multiple strings). Weighted variants for penalty-aware alignment (Smith-Waterman).

When to use it

Diff and patch generation, sequence comparison, similarity quantification. More appropriate than edit distance when order matters but contiguity doesn’t. Not suitable for short-string typo detection. Standard in bioinformatics for sequence analysis.

See also