Binary Embeddings

What it is

Binary embeddings compress float32 or float16 embedding vectors to 1 bit per dimension by thresholding each value at zero (positive → 1, negative → 0). Similarity between binary vectors is computed via Hamming distance — the number of differing bits — using CPU-native POPCNT instructions. This achieves 32x compression over float32 and retrieval that runs entirely on CPU without GPU or FAISS overhead. The trade-off is a recall penalty of ~5–20% depending on how the embeddings are trained.

[illustrate: float32 vector → sign(x) → binary vector; Hamming distance via POPCNT; comparison to dot-product similarity]

How it works

Quantization

import numpy as np

def to_binary(embedding):
    # float32 embedding: [-0.3, 0.8, -0.1, 0.6, ...]
    # Binary:            [  0,   1,   0,   1, ...]
    return (embedding > 0).astype(np.uint8)

def hamming_distance(a, b):
    # XOR + popcount
    return np.unpackbits(a ^ b).sum()

Retrieval with binary index

  1. Pre-encode all documents → binary vectors
  2. Pack 8 bits per byte: 768-dim embedding → 96 bytes (vs. 3072 bytes float32)
  3. At query time: binarize query, Hamming search over index
  4. Optional: re-score top-k with full float embeddings (two-stage)

Performance characteristics

float32 (768-dim): 3,072 bytes/embedding, dot product, GPU optimal
int8 (768-dim):      768 bytes/embedding, 4x smaller, fast CPU
binary (768-dim):     96 bytes/embedding, 32x smaller, POPCNT CPU

Training for binary retrieval

Models can be fine-tuned with binary quantization in the loop:

  • Straight-through estimator: backpropagate through sign() as if it were identity
  • Distillation: match binary dot products to float32 teacher scores
  • Matryoshka + binarization: MRL-trained embeddings at low dimensions + binarization

Variants and history

Binary embeddings for IR were studied pre-deep-learning (SimHash, MinHash). The neural revival is driven by the embedding model explosion: indexing billions of vectors as float32 is expensive; binary reduces storage by 32x and enables CPU-only retrieval. Cohere Embed v3 and Nomic Embed publish binary embedding models. FAISS supports binary indexes (IndexBinaryFlat, IndexBinaryIVF). The combination with Matryoshka Representation Learning (use 512 or 256 dims, then binarize) is increasingly common.

When to use it

Use binary embeddings when:

  • Index must fit in RAM without GPU (on-premise, edge deployment)
  • Retrieval latency is critical and Hamming distance is fast enough
  • A small recall penalty (5–15%) is acceptable
  • First-stage retrieval followed by float32 re-scoring is feasible

Not suitable when recall@k is paramount and no re-scoring stage is planned.

See also