Product Quantization: Compressing Vector Spaces (Day 6)

Welcome back! In our previous post, we explored Inverted File Indexes (IVF), a method for partitioning data into clusters to accelerate similarity search. Today, we’re building on that foundation by introducing Product Quantization (PQ). While IVF helps us where to search, PQ helps us how to represent the data within those search spaces, drastically reducing memory footprint and improving search speed.

The Problem with IVF Alone:

Imagine you have a massive IVF index. Each cluster (partition) still contains many vectors. Searching within these clusters can still be computationally expensive. Furthermore, storing all these high-dimensional vectors directly consumes a lot of memory. This is where Product Quantization comes to the rescue.

What is Product Quantization?

Product Quantization is a lossy compression technique. It’s based on the idea of dividing each high-dimensional vector into smaller, lower-dimensional subspaces. Then, we quantize (approximate) each subspace using a codebook of representative vectors (centroids). This effectively converts a continuous vector into a sequence of discrete codes, significantly reducing its size.

Analogy:

Think about compressing a photograph. Instead of storing the color value of every pixel, JPEG compression groups pixels into blocks and approximates the colors within each block using a limited set of colors. Product Quantization does the same thing for vectors – breaking them down and approximating them with a limited set of “codebook” entries.

The Process – A Step-by-Step Breakdown:

Subspace Division: A high-dimensional vector (e.g., 128 dimensions) is split into m lower-dimensional subspaces (e.g., 8 subspaces of 16 dimensions each).
Codebook Generation: For each subspace, we generate a codebook. This codebook contains k representative vectors (centroids) derived from the data within that subspace using a clustering algorithm like k-means.
Quantization: Each vector is then “quantized” by finding the closest centroid in each subspace and replacing the original subspace vector with the index of that centroid in the codebook.
Reconstruction (Optional): To estimate the original vector, we can reconstruct it using the centroids corresponding to the quantized indices. This reconstruction introduces a small amount of error, but it’s often a worthwhile trade-off for the significant reduction in storage and computation.

Visual Representation:

High-Dimensional Vector (128D) --> Subspace Division (8 x 16D) 
--> Codebook Generation (8 codebooks, k centroids each) --> 
Quantization (Indices) --> Compressed Representation

Code Example 1: Simple Codebook Generation (Python)

This example demonstrates how to create a simple codebook using k-means clustering.

import numpy as np
from sklearn.cluster import KMeans

def train_product_quantizer(data, m=8, k=256):
    """
    Train a Product Quantizer with m subspaces and k centroids per subspace.
    
    Args:
        data: Training data, shape (n_samples, d_dimensions)
        m: Number of subspaces
        k: Number of centroids per codebook
    
    Returns:
        List of m codebooks, each with k centroids
    """
    n_samples, d = data.shape
    subspace_dim = d // m
    
    codebooks = []
    
    # Train one codebook per subspace
    for i in range(m):
        # Extract data for this subspace
        subspace_start = i * subspace_dim
        subspace_end = (i + 1) * subspace_dim
        subspace_data = data[:, subspace_start:subspace_end]
        
        # Train k-means on this subspace
        kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
        kmeans.fit(subspace_data)
        
        # Store the codebook (centroids) for this subspace
        codebooks.append(kmeans.cluster_centers_)
        
        print(f"Codebook {i+1}/{m}: shape {kmeans.cluster_centers_.shape}")
    
    return codebooks

# Example: Train PQ on 1000 vectors of 128 dimensions
# Split into 8 subspaces of 16 dimensions, 256 centroids per subspace
data = np.random.rand(1000, 128)
codebooks = train_product_quantizer(data, m=8, k=256)

print(f"\nTotal codebooks: {len(codebooks)}")
print(f"Each codebook shape: {codebooks[0].shape}")  # (256, 16)

This function iterates through each subspace of the vector, finds the nearest centroid in the codebook, and stores the index of that centroid. The result is a sequence of indices representing the quantized vector.

Code Example 2: Reconstructing a Vector (Python)

While the quantized representation is compact, we can reconstruct an approximation of the original vector.

def reconstruct_vector(quantized_indices, codebook):
  """
  Reconstructs a vector from its quantized indices and the codebook.

  Args:
    quantized_indices: The sequence of indices representing the quantized vector.
    codebook: The codebook (numpy array).

  Returns:
    The reconstructed vector.
  """
  reconstructed_vector = np.zeros_like(codebook[0]) # Initialize reconstructed vector
  for i, index in enumerate(quantized_indices):
    reconstructed_vector[i*16:(i+1)*16] = codebook[index, i*16:(i+1)*16]
  return reconstructed_vector

# Example Usage:
def reconstruct_vector(quantized_indices, codebooks):
    """
    Reconstruct an approximate vector from its quantized indices.
    
    Args:
        quantized_indices: List of m indices (one per subspace)
        codebooks: List of m codebooks
    
    Returns:
        Reconstructed vector (approximation of original)
    """
    reconstructed_subspaces = []
    
    for i, index in enumerate(quantized_indices):
        # Get the centroid from this subspace's codebook
        centroid = codebooks[i][index]
        reconstructed_subspaces.append(centroid)
    
    # Concatenate all subspaces
    reconstructed_vector = np.concatenate(reconstructed_subspaces)
    
    return reconstructed_vector

# Example usage:
original_vector = np.random.rand(128)
print(f"Original vector shape: {original_vector.shape}")

# Quantize
quantized_indices = quantize_vector(original_vector, codebooks)
print(f"Quantized to {len(quantized_indices)} indices")

# Reconstruct
reconstructed = reconstruct_vector(quantized_indices, codebooks)
print(f"Reconstructed vector shape: {reconstructed.shape}")

# Calculate reconstruction error
error = np.linalg.norm(original_vector - reconstructed)
print(f"Reconstruction error (L2 distance): {error:.4f}")

This function retrieves the corresponding centroid from the codebook for each index and concatenates them to reconstruct the approximate vector.

Table of Contents

Asymmetric Distance Computation (ADC): Fast Search with PQ

The key insight of PQ is that we can compute distances without fully reconstructing the vectors:

During Search:

Query vector is NOT quantized (stays in original space)
Pre-compute distances from query subspaces to ALL codebook centroids
For each candidate vector, look up distances using its quantized indices
Sum the pre-computed distances = approximate distance to query

Code Example:

def compute_distance_table(query_vector, codebooks):
    """
    Pre-compute distances from query to all codebook centroids.
    
    Returns: List of distance tables, one per subspace
    """
    m = len(codebooks)
    d = len(query_vector)
    subspace_dim = d // m
    
    distance_tables = []
    
    for i, codebook in enumerate(codebooks):
        # Extract query subspace
        query_subspace = query_vector[i*subspace_dim:(i+1)*subspace_dim]
        
        # Compute distances to all centroids in this codebook
        distances = np.linalg.norm(codebook - query_subspace, axis=1)
        distance_tables.append(distances)
    
    return distance_tables

def compute_approximate_distance(quantized_indices, distance_tables):
    """
    Compute approximate distance using pre-computed table.
    """
    total_distance = 0
    for i, index in enumerate(quantized_indices):
        total_distance += distance_tables[i][index]
    
    return total_distance

# Example: Fast search
query = np.random.rand(128)
distance_tables = compute_distance_table(query, codebooks)

# For each candidate vector (represented by quantized indices)
candidate_indices = quantize_vector(np.random.rand(128), codebooks)
approx_distance = compute_approximate_distance(candidate_indices, distance_tables)

print(f"Approximate distance: {approx_distance:.4f}")
# This is MUCH faster than reconstructing and computing exact distance!

Why ADC is Fast:

Pre-computation: O(m × k) – done once per query
Per-candidate: O(m) lookups – just summing m numbers
vs. exact: O(d) – full 128-dimensional distance computation

Combining IVF and PQ:

The true power of Product Quantization lies in its synergy with IVF. After performing IVF to partition the data, we can apply PQ to the vectors within each partition. This significantly reduces the memory footprint of the index and speeds up the search process. Instead of storing full vectors, we store the quantized indices for each vector within a partition.

Practical Considerations & Trade-offs:

Number of Subspaces (m): A larger number of subspaces leads to better compression but also increases the computational cost of quantization.
Number of Codebook Entries (k): A larger codebook provides better approximation but requires more memory.
Reconstruction Error: Quantization introduces a small amount of error. The acceptable level of error depends on the specific application.
Computational Overhead: While PQ reduces memory usage, it does introduce some computational overhead for quantization and reconstruction.

Visualizing the Benefits:

Imagine an IVF index with 1000 partitions. Each partition contains 1000 vectors of 128 dimensions.

Without PQ:

Storage per vector: 128 dims × 4 bytes (float32) = 512 bytes
Total per partition: 1000 vectors × 512 bytes = 512 KB
Total index size: 1000 partitions × 512 KB = 512 MB

With PQ (m=8 subspaces, k=256 centroids):

Storage per vector: 8 indices × 1 byte = 8 bytes (since 256 < 2^8)
Total per partition: 1000 vectors × 8 bytes = 8 KB
Codebook storage: 8 codebooks × 256 centroids × 16 dims × 4 bytes = 131 KB
Total index size: (1000 partitions × 8 KB) + 131 KB ≈ 8 MB

Compression ratio: 512 MB / 8 MB = 64x smaller!

The trade-off is reconstruction error (typically 1-5% depending on parameters).

Conclusion:

Product Quantization is a powerful technique for compressing vector spaces and accelerating similarity search. When combined with IVF, it provides a highly efficient solution for indexing large datasets. By understanding the trade-offs and practical considerations, you can leverage PQ to build highly scalable and performant search systems. Remember to experiment with different parameters to find the optimal configuration for your specific application.

Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Product Quantization: Compressing Vector Spaces (Day 6)

Asymmetric Distance Computation (ADC): Fast Search with PQ

Like this:

Related

Discover more from A Streak of Communication

Asymmetric Distance Computation (ADC): Fast Search with PQ

Share this:

Like this:

Related

Discover more from A Streak of Communication

Check this too

Replication Strategies: Synchronous vs. Asynchronous

Replication: Ensuring Data Availability

Sharding Deep Dive: Consistent Hashing

Discover more from A Streak of Communication