Product Quantization
Also known as: PQ, PQ indexing, PQ compression
- Product Quantization
- A vector compression method that divides high-dimensional vectors into smaller subvectors, quantizes each independently using learned codebooks, and stores compact codes that enable fast approximate nearest neighbor search with reduced memory.
Product quantization is a vector compression technique that splits high-dimensional vectors into smaller subvectors, quantizes each independently, and enables approximate nearest neighbor search using a fraction of the original memory.
What It Is
When you search for similar items in a vector database — whether finding relevant documents for a chatbot or matching product images — the system compares your query against potentially billions of stored embedding vectors. Each vector might contain hundreds of floating-point numbers. Storing and searching all of them in full precision quickly becomes impractical. Product quantization solves this by compressing vectors into compact codes that are small enough to keep in memory while still preserving enough information for similarity search.
Think of it like describing a house using a fixed checklist instead of a full paragraph. Rather than writing “three-bedroom colonial with a wraparound porch and slate roof,” you pick from pre-built categories: style: colonial, bedrooms: 3, exterior: porch, roof: slate. Each category has a limited set of options. You lose some nuance, but you can compare houses much faster because you’re matching category codes instead of reading full descriptions.
According to Jegou et al., the technique works by splitting each high-dimensional vector into M smaller subvectors (called subspaces). A separate k-means clustering (a method that groups similar points around representative centers) runs on each subspace during a training phase, producing a codebook of k centroids per subspace. To encode a vector, PQ replaces each subvector with the index of its nearest centroid. With k set to 256 (a common default), each subvector index fits in a single byte, so an entire vector compresses down to just M bytes regardless of its original dimensionality.
During search, the system doesn’t need to decompress anything. According to Jegou et al., Asymmetric Distance Computation (ADC) lets the system compare the exact query vector against quantized database vectors directly. It pre-computes a small lookup table of distances between each query subvector and all centroids, then sums the relevant table entries for each database vector. This table lookup replaces expensive floating-point distance calculations with fast integer additions.
How It’s Used in Practice
If you work with any application that retrieves information from large collections — semantic search, retrieval-augmented generation (RAG), recommendation engines, or image similarity — there’s a good chance product quantization is running behind the scenes. According to FAISS Docs, Meta’s FAISS library implements PQ through its IndexPQ and IndexIVFPQ index types, combining inverted file indexing with product quantization to search billions of vectors on commodity hardware. Most major vector databases, including Milvus, Weaviate, and Qdrant, offer PQ-based indexing as a built-in option.
The typical workflow looks like this: generate embeddings from your data, train PQ codebooks on a representative sample, encode your full dataset, and run queries. The tradeoff is explicit — you accept some loss in recall (the percentage of true nearest neighbors returned) in exchange for much lower memory usage and faster search.
Pro Tip: Start with PQ only after flat (exact) search becomes too slow or memory-hungry. PQ adds a training step and introduces approximation error. For datasets under a few hundred thousand vectors, exact search with HNSW (a graph-based index) or flat indexes is usually fast enough and avoids tuning the number of subspaces (M) and centroids (k).
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Dataset has millions or billions of vectors that won’t fit in RAM | ✅ | |
| You need exact nearest neighbor results with zero error | ❌ | |
| Building a RAG pipeline over a large document corpus | ✅ | |
| Vectors have very low dimensionality (under 32 dimensions) | ❌ | |
| Latency and memory budget matter more than perfect recall | ✅ | |
| Prototyping with a small dataset where quick iteration matters | ❌ |
Common Misconception
Myth: Product quantization preserves exact distances between vectors, so search results are the same as brute-force search. Reality: PQ is lossy compression. It approximates distances, not preserves them. At higher compression ratios (fewer subspaces), recall drops noticeably. Always benchmark recall against your accuracy requirements before deploying PQ in production.
One Sentence to Remember
Product quantization trades precision for scale — it compresses vectors into compact codes so you can search billions of them in the memory that would otherwise hold only millions, making it one of the core building blocks that turn distance metrics into practical index structures.
FAQ
Q: How much memory does product quantization actually save? A: It depends on the number of subspaces (M). With M=8 and 256 centroids per subspace, a 128-dimensional float32 vector shrinks from 512 bytes to 8 bytes — a 64x reduction.
Q: Does product quantization work with any distance metric? A: PQ works best with Euclidean (L2) distance. It can approximate inner product with modifications, but cosine similarity typically requires normalizing vectors before quantization.
Q: What is the difference between product quantization and scalar quantization? A: Scalar quantization reduces precision per dimension (like float32 to int8). Product quantization groups dimensions into subspaces and replaces each group with a centroid index, achieving higher compression at the cost of a training step.
Sources
- Jegou et al.: Product Quantization for Nearest Neighbor Search (IEEE TPAMI 2011) - The foundational paper introducing PQ and asymmetric distance computation
- FAISS Docs: Welcome to Faiss Documentation - Reference implementation with IndexPQ and IndexIVFPQ index types
Expert Takes
Product quantization applies a subspace decomposition that reduces the curse of dimensionality from a storage perspective while preserving enough geometric structure for approximate search. The key theoretical insight is that distances computed from concatenated subspace approximations converge toward true distances as the number of subspaces grows, though PQ lacks formal error bounds — a gap that newer quantization methods are beginning to address.
When you configure a vector index in FAISS or any database that supports PQ, the choice of M (subspaces) and k (centroids) directly determines your recall-memory tradeoff. The practical pattern is: train on a representative sample, benchmark recall at your target query latency, then adjust M upward if recall is too low. Skipping the benchmark step is how teams end up with search results that quietly miss relevant documents.
Product quantization is what makes billion-scale similarity search economically viable. Without compression, serving large embedding collections in their full dimensionality requires enormous amounts of RAM — a cost most organizations can’t justify. PQ flips that equation by shrinking memory requirements by orders of magnitude, turning vector search from a capability reserved for the largest tech companies into something any team can deploy.
The lossy nature of product quantization raises a question worth sitting with: when approximate search misses a relevant result, who notices? In recommendation systems, a missed match is a minor inconvenience. In medical retrieval or legal discovery, it could mean overlooking a critical document. The compression ratio you choose is an implicit statement about how much accuracy you’re willing to sacrifice — and for whom.