Binary Embeddings
What it is
Binary embeddings compress float32 or float16 embedding vectors to 1 bit per dimension by thresholding each value at zero (positive → 1, negative → 0). Similarity between binary vectors is computed via Hamming distance — the number of differing bits — using CPU-native POPCNT instructions. This achieves 32x compression over float32 and retrieval that runs entirely on CPU without GPU or FAISS overhead. The trade-off is a recall penalty of ~5–20% depending on how the embeddings are trained.
[illustrate: float32 vector → sign(x) → binary vector; Hamming distance via POPCNT; comparison to dot-product similarity]
How it works
Quantization
import numpy as np
def to_binary(embedding):
# float32 embedding: [-0.3, 0.8, -0.1, 0.6, ...]
# Binary: [ 0, 1, 0, 1, ...]
return (embedding > 0).astype(np.uint8)
def hamming_distance(a, b):
# XOR + popcount
return np.unpackbits(a ^ b).sum()
Retrieval with binary index
- Pre-encode all documents → binary vectors
- Pack 8 bits per byte: 768-dim embedding → 96 bytes (vs. 3072 bytes float32)
- At query time: binarize query, Hamming search over index
- Optional: re-score top-k with full float embeddings (two-stage)
Performance characteristics
float32 (768-dim): 3,072 bytes/embedding, dot product, GPU optimal
int8 (768-dim): 768 bytes/embedding, 4x smaller, fast CPU
binary (768-dim): 96 bytes/embedding, 32x smaller, POPCNT CPU
Training for binary retrieval
Models can be fine-tuned with binary quantization in the loop:
- Straight-through estimator: backpropagate through sign() as if it were identity
- Distillation: match binary dot products to float32 teacher scores
- Matryoshka + binarization: MRL-trained embeddings at low dimensions + binarization
Variants and history
Binary embeddings for IR were studied pre-deep-learning (SimHash, MinHash). The neural revival is driven by the embedding model explosion: indexing billions of vectors as float32 is expensive; binary reduces storage by 32x and enables CPU-only retrieval. Cohere Embed v3 and Nomic Embed publish binary embedding models. FAISS supports binary indexes (IndexBinaryFlat, IndexBinaryIVF). The combination with Matryoshka Representation Learning (use 512 or 256 dims, then binarize) is increasingly common.
When to use it
Use binary embeddings when:
- Index must fit in RAM without GPU (on-premise, edge deployment)
- Retrieval latency is critical and Hamming distance is fast enough
- A small recall penalty (5–15%) is acceptable
- First-stage retrieval followed by float32 re-scoring is feasible
Not suitable when recall@k is paramount and no re-scoring stage is planned.