Practical Guide to Scaling Vector Search with turbovec
The Problem: Memory and Cost in Large‑Scale Vector Search
Storing high‑dimensional embeddings in raw float32 quickly becomes prohibitive.
- A 10 million‑document corpus with 1536‑dim vectors needs ≈31 GB of RAM.
- Teams running local or on‑premise RAG pipelines hit memory limits, forcing costly hardware upgrades or forced down‑sampling that hurts retrieval quality.
Why Float32 Embeddings Are Expensive
- Each dimension occupies 4 bytes; memory grows linearly with dim × count.
- No compression is applied, so the index must reside entirely in RAM for low‑latency search.
Limitations of Traditional Quantization (FAISS PQ)
- Requires a codebook training step (k‑means on a sample) before indexing.
- If the corpus drifts or expands, the codebook must be recomputed and the index rebuilt.
- Training adds latency and complicates incremental updates.
Introducing turbovec: A Data‑Oblivious Quantization Solution
turbovec implements Google Research’s TurboQuant algorithm, a quantization method that needs zero training and works on any data distribution. The result is a compact index that can be built instantly and searched efficiently on both ARM and x86 CPUs.
How TurboQuant Works
- Normalize each vector to unit length; store the norm as a separate float.
- Apply a shared random rotation so that each coordinate follows a known (approximately Gaussian) distribution.
- Perform Lloyd‑Max scalar quantization using pre‑computed optimal bucket boundaries—no data passes needed.
- Bit‑pack the quantized coordinates; a 1536‑dim vector drops from 6 144 bytes (FP32) to 384 bytes at 2‑bit (16× compression).
Benefits Over Existing Approaches
- No codebook training → instant indexing, seamless handling of growing corpora.
- Deterministic compression → predictable memory footprint.
- Search speed → turbovec beats FAISS IndexPQFastScan by 12‑20 % on ARM and is competitive on x86.
- Near‑optimal distortion → within ~2.7× of the Shannon lower bound.
- Fully local → no external service, no data egress, ideal for air‑gapped or regulated environments.
Getting Started with turbovec
Installation
bash
Python
pip install turbovec
Optional framework extras
pip install turbovec[langchain]
pip install turbovec[llama-index]
pip install turbovec[haystack]
Rust
cargo add turbovec
Basic Usage (TurboQuantIndex)
python
from turbovec import TurboQuantIndex
import numpy as np
1536‑dim vectors, 4‑bit quantization
index = TurboQuantIndex(dim=1536, bit_width=4)
vectors: np.ndarray of shape [n, 1536], dtype=float32
index.add(vectors) # incremental adds are allowed
index.add(more_vectors)
Search
scores, indices = index.search(query, k=10) # query: float32[1536]
Managing Stable IDs (IdMapIndex)
When you need to delete or update vectors by an external identifier:
python
from turbovec import IdMapIndex
import numpy as np
index = IdMapIndex(dim=1536, bit_width=4)
Map vectors to your own uint64 IDs
ids = np.array([1001, 1002, 1003], dtype=np.uint64)
index.add_with_ids(vectors, ids)
Search returns the external IDs, not positional offsets
scores, returned_ids = index.search(query, k=10)
O(1) delete by external ID
index.remove(1002)
Persistence (Save & Load)
python
TurboQuantIndex → .tq file
index.write(“my_index.tq”)
loaded = TurboQuantIndex.load(“my_index.tq”)
IdMapIndex → .tvim file
index.write(“my_index.tvim”)
loaded = IdMapIndex.load(“my_index.tvim”)
Framework Integrations
turbovec plugs directly into popular RAG stacks as a drop‑in vector store.
- LangChain –
pip install turbovec[langchain] - LlamaIndex –
pip install turbovec[llama-index] - Haystack –
pip install turbovec[haystack]
Each extra registers turbovec as the underlying VectorStore implementation, allowing you to keep the same API while gaining the compression and speed benefits.
Performance and Accuracy Expectations
| Metric | Typical Result (100 K vectors, 1 000 queries, k=64) |
|---|---|
| Compression | 2‑bit → 16× (6 144 B → 384 B per vector) |
| Recall@1 | OpenAI embeddings (d=1536/3072): within 0‑1 pt of FAISS IndexPQ; GloVe (d=200): 3‑6 pt lower at R@1, catches up by k≈16‑32 |
| Search speed (ARM) | 12‑20 % faster than FAISS IndexPQFastScan across all configs |
| Search speed (x86) | 1‑6 % ahead on 4‑bit; within ~1 % on 2‑bit (two edge cases slightly behind FAISS due to short inner loops) |
These numbers show that you can cut memory by an order of magnitude while maintaining retrieval quality and gaining or matching query throughput.
Best Practices for Production RAG Pipelines
Choosing Bit Width
- 2‑bit – maximal compression (16×); use when memory is the primary constraint and slight recall loss is acceptable.
- 4‑bit – better recall (especially on low‑dim embeddings) with still 8× compression; a good default for most workloads.
Handling Corpus Growth
- Because turbovec needs no retraining, you can call
add()continuously as new documents arrive. - Monitor RAM usage; if the index approaches your memory limit, consider increasing
bit_widthor sharding the index across multiple turbovec instances.
Air‑Gapped Deployments
- All operations are local; no telemetry or external calls.
- Pair turbovec with an open‑source embedding model (e.g., Sentence‑Transformers, BGE) that runs on‑premise for a fully private RAG stack.
Monitoring & Tuning
- Log index size after each bulk add to verify compression ratio.
- Periodically run a recall benchmark on a held‑out set to ensure the chosen
bit_widthstill meets your quality SLA. - If latency spikes, check whether the index is being searched single‑threaded; enable multi‑threaded search by providing multiple query vectors or using the Rust API’s thread‑pool options.
Further Resources
- GitHub Repository – https://github.com/RyanCodrai/turbovec (source, issues, releases)
- TurboQuant Paper – https://arxiv.org/abs/2504.19874 (details of the data‑oblivious quantizer)
- Documentation – see the
docs/folder in the repo for API reference, integration examples, and performance tuning tips.
By adopting turbovec, teams can shrink their vector indexes from gigabytes to a few hundred megabytes, eliminate costly retraining steps, and run fast, reliable similarity search on the hardware they already have. This makes scalable, production‑grade RAG feasible even in resource‑constrained or air‑gapped environments.


























