Itinai.com a realistic user interface of a modern ai powered ba94bb85 c764 4faa 963c 3c93dfb87a10 0
Itinai.com a realistic user interface of a modern ai powered ba94bb85 c764 4faa 963c 3c93dfb87a10 0

Boost Python Vector Search with Turbovec TurboQuant Index

Practical Guide to Scaling Vector Search with turbovec


The Problem: Memory and Cost in Large‑Scale Vector Search

Storing high‑dimensional embeddings in raw float32 quickly becomes prohibitive.

  • A 10 million‑document corpus with 1536‑dim vectors needs ≈31 GB of RAM.
  • Teams running local or on‑premise RAG pipelines hit memory limits, forcing costly hardware upgrades or forced down‑sampling that hurts retrieval quality.

Why Float32 Embeddings Are Expensive

  • Each dimension occupies 4 bytes; memory grows linearly with dim × count.
  • No compression is applied, so the index must reside entirely in RAM for low‑latency search.

Limitations of Traditional Quantization (FAISS PQ)

  • Requires a codebook training step (k‑means on a sample) before indexing.
  • If the corpus drifts or expands, the codebook must be recomputed and the index rebuilt.
  • Training adds latency and complicates incremental updates.

Introducing turbovec: A Data‑Oblivious Quantization Solution

turbovec implements Google Research’s TurboQuant algorithm, a quantization method that needs zero training and works on any data distribution. The result is a compact index that can be built instantly and searched efficiently on both ARM and x86 CPUs.

How TurboQuant Works

  1. Normalize each vector to unit length; store the norm as a separate float.
  2. Apply a shared random rotation so that each coordinate follows a known (approximately Gaussian) distribution.
  3. Perform Lloyd‑Max scalar quantization using pre‑computed optimal bucket boundaries—no data passes needed.
  4. Bit‑pack the quantized coordinates; a 1536‑dim vector drops from 6 144 bytes (FP32) to 384 bytes at 2‑bit (16× compression).

Benefits Over Existing Approaches

  • No codebook training → instant indexing, seamless handling of growing corpora.
  • Deterministic compression → predictable memory footprint.
  • Search speed → turbovec beats FAISS IndexPQFastScan by 12‑20 % on ARM and is competitive on x86.
  • Near‑optimal distortion → within ~2.7× of the Shannon lower bound.
  • Fully local → no external service, no data egress, ideal for air‑gapped or regulated environments.

Getting Started with turbovec

Installation

bash

Python

pip install turbovec

Optional framework extras

pip install turbovec[langchain]
pip install turbovec[llama-index]
pip install turbovec[haystack]

Rust

cargo add turbovec

Basic Usage (TurboQuantIndex)

python
from turbovec import TurboQuantIndex
import numpy as np

1536‑dim vectors, 4‑bit quantization

index = TurboQuantIndex(dim=1536, bit_width=4)

vectors: np.ndarray of shape [n, 1536], dtype=float32

index.add(vectors) # incremental adds are allowed
index.add(more_vectors)

Search

scores, indices = index.search(query, k=10) # query: float32[1536]

Managing Stable IDs (IdMapIndex)

When you need to delete or update vectors by an external identifier:

python
from turbovec import IdMapIndex
import numpy as np

index = IdMapIndex(dim=1536, bit_width=4)

Map vectors to your own uint64 IDs

ids = np.array([1001, 1002, 1003], dtype=np.uint64)
index.add_with_ids(vectors, ids)

Search returns the external IDs, not positional offsets

scores, returned_ids = index.search(query, k=10)

O(1) delete by external ID

index.remove(1002)

Persistence (Save & Load)

python

TurboQuantIndex → .tq file

index.write(“my_index.tq”)
loaded = TurboQuantIndex.load(“my_index.tq”)

IdMapIndex → .tvim file

index.write(“my_index.tvim”)
loaded = IdMapIndex.load(“my_index.tvim”)


Framework Integrations

turbovec plugs directly into popular RAG stacks as a drop‑in vector store.

  • LangChainpip install turbovec[langchain]
  • LlamaIndexpip install turbovec[llama-index]
  • Haystackpip install turbovec[haystack]

Each extra registers turbovec as the underlying VectorStore implementation, allowing you to keep the same API while gaining the compression and speed benefits.


Performance and Accuracy Expectations

Metric Typical Result (100 K vectors, 1 000 queries, k=64)
Compression 2‑bit → 16× (6 144 B → 384 B per vector)
Recall@1 OpenAI embeddings (d=1536/3072): within 0‑1 pt of FAISS IndexPQ; GloVe (d=200): 3‑6 pt lower at R@1, catches up by k≈16‑32
Search speed (ARM) 12‑20 % faster than FAISS IndexPQFastScan across all configs
Search speed (x86) 1‑6 % ahead on 4‑bit; within ~1 % on 2‑bit (two edge cases slightly behind FAISS due to short inner loops)

These numbers show that you can cut memory by an order of magnitude while maintaining retrieval quality and gaining or matching query throughput.


Best Practices for Production RAG Pipelines

Choosing Bit Width

  • 2‑bit – maximal compression (16×); use when memory is the primary constraint and slight recall loss is acceptable.
  • 4‑bit – better recall (especially on low‑dim embeddings) with still 8× compression; a good default for most workloads.

Handling Corpus Growth

  • Because turbovec needs no retraining, you can call add() continuously as new documents arrive.
  • Monitor RAM usage; if the index approaches your memory limit, consider increasing bit_width or sharding the index across multiple turbovec instances.

Air‑Gapped Deployments

  • All operations are local; no telemetry or external calls.
  • Pair turbovec with an open‑source embedding model (e.g., Sentence‑Transformers, BGE) that runs on‑premise for a fully private RAG stack.

Monitoring & Tuning

  • Log index size after each bulk add to verify compression ratio.
  • Periodically run a recall benchmark on a held‑out set to ensure the chosen bit_width still meets your quality SLA.
  • If latency spikes, check whether the index is being searched single‑threaded; enable multi‑threaded search by providing multiple query vectors or using the Rust API’s thread‑pool options.

Further Resources

By adopting turbovec, teams can shrink their vector indexes from gigabytes to a few hundred megabytes, eliminate costly retraining steps, and run fast, reliable similarity search on the hardware they already have. This makes scalable, production‑grade RAG feasible even in resource‑constrained or air‑gapped environments.

Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions