Bloom Filter

What it is

A Bloom filter is a space-efficient probabilistic data structure for testing set membership. It answers “Is x in the set?” with:

  • True if likely (may have false positives)
  • False if definitely not (no false negatives)

Bloom filters use k hash functions and a bit array; space complexity O(n) bits (vs. O(n log n) for exact set storage). False positive rate tunable via filter size and hash count.

[illustrate: Bit array with k hash functions; insertion setting bits; membership test checking all k bits]

How it works

  1. Initialization: Create bit array of size m, initialize to 0

  2. Insertion:

    • Hash item with k independent hash functions
    • Set k bits in array to 1
  3. Membership test:

    • Hash item with k hash functions
    • If all k bits are 1: probably in set (may be false positive)
    • If any bit is 0: definitely not in set (no false negatives)
  4. Parameters:

    • m: bit array size
    • k: number of hash functions
    • n: expected number of items
    • False positive rate: ≈ (1 - e^(-kn/m))^k ≈ 2^(-k) for optimal k

Example

Set: {"apple", "banana", "cherry"}
m = 10 bits, k = 3 hash functions

Insertion "apple":
  hash1("apple") = 2 → set bit 2
  hash2("apple") = 5 → set bit 5
  hash3("apple") = 8 → set bit 8
  Array: [0,0,1,0,0,1,0,0,1,0]

Query "apple":
  hash1, hash2, hash3 → bits 2, 5, 8 → all 1 ✓
  Result: "probably in set"

Query "grape":
  hash1("grape") = 1 → bit 1 is 0
  Result: "definitely not in set"

Query "date":
  hash1, hash2, hash3 → bits 3, 6, 7 → all 1
  Result: "probably in set" (false positive if not actually inserted)

Variants and history

Bloom filters introduced by Burton Bloom (1970). Counting Bloom filters enable deletion. Spectral Bloom filters store frequencies. Scalable Bloom filters grow dynamically. Used in:

  • Web caches (check before fetching)
  • Spell-checking (known words)
  • Duplicate detection
  • Network routers (fast membership test)

Standard in distributed systems and databases.

When to use it

Use Bloom filters when:

  • Space is critical (millions of items)
  • False positives are acceptable (e.g., check before expensive operation)
  • No need for deletion or deletions rare
  • Membership query speed matters
  • Lightweight encoding of set information

Bloom filters sacrifice accuracy for space. Trade-off: false positive rate vs. space-time. Not suitable if false positives are unacceptable.

See also