[P] TurboQuant Pro: Open-source vector compression toolkit — 5-42x smaller embeddings with 0.97+ recall [R]
TL;DR: We built an open-source toolkit that compresses high-dimensional vectors (embeddings, KV cache, anything in pgvector/FAISS) by 5-42x while maintaining 0.95+ cosine similarity. Benchmarked 6 methods on 2.4M real embeddings. MIT licensed.
GitHub: https://github.com/ahb-sjsu/turboquant-pro
Install: pip install turboquant-pro
The Problem
Vector databases are eating RAM. If you're running RAG with BGE-M3 (1024-dim float32), each embedding is 4KB. At 1M vectors that's 4GB just for embeddings. At 10M you need 40GB. pgvector, FAISS, Pinecone — they all have this problem.
What We Built
TurboQuant Pro implements and benchmarks 6 compression methods:
| Method | Ratio | Cosine Sim | Recall@10 | Complexity |
|---|---|---|---|---|
| Scalar int8 | 4x | 0.999 | 0.99 | Trivial |
| Matryoshka truncation | 4x | 0.97 | 0.96 | Trivial |
| TurboQuant 3-bit | 5.1x | 0.978 | 0.97 | Medium |
| pgvector bytea (TQ) | 10.5x | 0.978 | 0.95 | Medium |
| Matryoshka + int8 | 16x | 0.97 | 0.94 | Low |
| Matryoshka + TQ 3-bit | 42x | 0.93 | 0.90 | Medium |
The core algorithm is PolarQuant + QJL from Zandieh et al. (ICLR 2026) — random rotation maps vectors onto a hypersphere, then Lloyd-Max scalar quantization compresses each coordinate to b bits. We added bit-packing, CUDA kernels, and a streaming KV cache manager on top.
What's Novel
- First open-source implementation of the Zandieh et al. TurboQuant algorithm
- Multi-method benchmarking on real data (2.4M embeddings from a cross-civilizational ethics corpus spanning 5,000 years — long story)
- Practical recommendations — we found that for most RAG use cases, Matryoshka truncation + scalar int8 (16x, zero training, 3 lines of code) beats fancy methods. TurboQuant's rotation trick only wins for KV cache where you need quality at high compression.
- pgvector integration — store compressed embeddings as bytea, search in compressed space
- Streaming KV cache with L1 (hot, uncompressed) / L2 (cold, compressed) tiering
Origin Story
This started as a beam search optimization in a symbolic AI system (Theory Radar — formula search engine). The beam candidates were high-dimensional vectors that we compressed to fit wider beams in GPU memory. Then we realized the same trick works for LLM KV cache, then for RAG embeddings, then for... everything.
Adapted from a production system running on 2x Quadro GV100 32GB. Benchmarked against a real pgvector database with 2.4M vectors. This isn't synthetic data — it's actual BGE-M3 embeddings from texts spanning Ancient Greek philosophy to Buddhist suttas to Reddit advice columns.
Key Finding
Simple beats clever for most use cases. Scalar int8 gives you 4x compression at 0.999 cosine similarity with literally 3 lines of NumPy. Matryoshka truncation (just slicing the vector) gives another free 4x if your embedding model supports it (BGE-M3 does). Combined that's 16x with zero moving parts, zero training, zero codebooks that can go stale.
TurboQuant's rotation trick is worth the complexity only when you need the last bit of quality at high compression — specifically for KV cache in long-context inference where the quality/compression tradeoff directly affects output quality.
Technical Details
The PolarQuant step: ```python
Random rotation maps any distribution onto unit hypersphere
Q, _ = np.linalg.qr(rng.standard_normal((dim, dim))) rotated = x @ Q # Now each coordinate ~ N(0, 1/sqrt(dim))
Lloyd-Max scalar quantizer (optimal for Gaussian)
indices = np.digitize(rotated, codebook_boundaries) # b-bit per coordinate
Bit-pack: 8 x 3-bit values → 3 bytes
packed = pack_3bit(indices) # 5.12x compression ```
Decompression: ```python reconstructed = codebook_centroids[unpack_3bit(packed)] original_approx = reconstructed @ Q.T # Inverse rotation
cosine_similarity(original, original_approx) ≈ 0.978
```
CuPy CUDA kernels for GPU: ~25 GB/s throughput on Volta.
Usage
```python from turboquant_pro import TurboQuantKV
tq = TurboQuantKV(head_dim=1024, bits=3) compressed = tq.compress(embeddings, packed=True) # 5.12x smaller recovered = tq.decompress(compressed) # 0.978 cosine sim ```
For pgvector: ```python from turboquant_pro.pgvector import TurboQuantPGVector
tqpg = TurboQuantPGVector(dim=1024, bits=3) bytea_data = tqpg.to_pgbytea(embedding) # 4096 bytes → 388 bytes ```
What's Next
- Autotune CLI:
turboquant-pro autotune --source postgres://... --min-recall 0.95 - Native pgvector extension (C, not Python wrapper)
- FAISS integration
- vLLM KV cache plugin
- Proper paper (arXiv draft is in the repo)
Links
- Code: https://github.com/ahb-sjsu/turboquant-pro
- Install:
pip install turboquant-pro - Paper: Zandieh et al., "Sub-linear Memory Inference via PolarQuant and QJL", ICLR 2026
- License: MIT
Feedback welcome. We're particularly interested in benchmarks on other embedding models (OpenAI ada-002, Cohere, etc.) and at larger scale (100M+ vectors).
Built as part of the Atlas AI cognitive architecture project. The 2.4M ethics embeddings come from texts spanning Ancient Greek (Homer, Plato, Aristotle), Hebrew (Talmud, Mishnah), Buddhist (Pali Canon), Sanskrit (Vedas, Upanishads), Old Norse (Eddas), and modern advice columns. Because why not.
[link] [comments]
Want to read more?
Check out the full article on the original site