In our previous post, we explored how to transform textual input into numerical representations called embeddings. Now that we have these vectors, we need a place to store them. This is where the โStoreโ component of our retrieval pipeline comes into play. Think of it as the library where all our embedding vectors reside, ready to be searched and compared. This post will cover the core concepts behind storing embeddings, different storage options, and the challenges associated with scaling this crucial component.
Table of Contents
Why Do We Need a Dedicated Store?
Imagine having a massive dataset of text documents, each converted into a vector representation. Simply listing these vectors in a Python list wouldnโt be efficient for searching. We need a system that allows us to:
- Efficiently store and retrieve vectors: Fast lookups are critical for real-time search.
- Support approximate nearest neighbor (ANN) search: Finding the closest vectors is often more practical than a brute-force comparison of every vector.
- Scale to massive datasets: Real-world applications often involve millions or even billions of vectors.
This is where specialized storage solutions come into play, often referred to as vector databases, though they can also be implemented with standard databases.
Vector Databases vs. Standard Databases
While standard databases can be used to store embeddings (e.g., using PostgreSQLโs vector extensions), vector databases are specifically designed for this purpose. They offer several advantages:
- Optimized for ANN search: Built-in algorithms for fast approximate nearest neighbor search.
- Scalability: Designed to handle massive datasets and high query volumes.
- Specialized data structures: Utilize efficient data structures like HNSW (Hierarchical Navigable Small World) graphs.
- Managed services: Many cloud providers offer fully managed vector database services.
Comparison Table:
| Feature | Standard Database (with Vector Extensions) | Vector Database |
|---|---|---|
| ANN Search | Requires manual implementation | Built-in |
| Scalability | Limited without significant engineering effort | Designed for massive datasets |
| Data Structures | General-purpose | Specialized (e.g., HNSW) |
| Ease of Use | More complex to set up and manage | Easier to use, often managed services |
| Cost | Can be cost-effective for smaller datasets | Can be more expensive for very large datasets |
Basic Storage: Python Lists (Not Recommended for Production)
Letโs start with the simplest approach: storing embeddings in a Python list. This is purely for illustrative purposes and should not be used in production.
# Simple Example: Storing embeddings in a Python list
import numpy as np
# Create some dummy embeddings
embedding_1 = np.array([0.1, 2.3, 1.5])
embedding_2 = np.array([1.2, 3.4, 2.5])
embedding_3 = np.array([0.8, 4.1, 3.2])
# Store embeddings in a list
embeddings = [embedding_1, embedding_2, embedding_3]
# Access an embedding
print(embeddings[0]) # Output: [0.1 2.3 1.5]
# This is extremely inefficient for large datasets!
Output:
[0.1 2.3 1.5]
This approach has severe limitations:
- Slow search: Finding the nearest neighbor requires comparing the query vector to every vector in the list.
- Memory inefficiency: Python lists can be memory-intensive, especially for large datasets.
- No indexing: No way to speed up the search process.
NumPy Arrays and SciPy (Slightly Better, Still Not Ideal)
Using NumPy arrays offers some improvements over Python lists, especially for numerical operations. SciPy provides efficient search algorithms, but it still lacks the specialized features of a vector database.
# Using NumPy and SciPy for basic search
import numpy as np
from scipy.spatial import KDTree
# Create some dummy embeddings
embeddings = np.array([
[0.1, 2.3, 1.5],
[1.2, 3.4, 2.5],
[0.8, 4.1, 3.2],
[2.1, 1.8, 0.9]
])
# Build a KDTree for efficient search
tree = KDTree(embeddings)
# Query vector
query_vector = np.array([0.9, 1.9, 2.1])
# Find the nearest neighbor
distance, index = tree.query(query_vector)
print(f"Nearest neighbor index: {index}")
print(f"Distance: {distance}")
Output:
Nearest neighbor index: 2
Distance: 0.5699072534052913
While this is faster than a brute-force search through a Python list, it still doesnโt scale well. KDTree construction can be slow for very large datasets.
HNSW (Hierarchical Navigable Small World) Graphs
HNSW is a graph-based algorithm that provides excellent performance for approximate nearest neighbor search. Many vector databases use HNSW as their underlying search algorithm.
Unfortunately, directly implementing HNSW from scratch is quite complex. Instead, weโre going to use a library that provides HNSW functionality. Letโs use nmslib for this example.
First, install it:
pip install nmslib
Now, letโs create a simple example:
# Using nmslib for HNSW search
import nmslib
import numpy as np
# Create some dummy embeddings
embeddings = np.array([
[0.1, 2.3, 1.5],
[1.2, 3.4, 2.5],
[0.8, 4.1, 3.2],
[2.1, 1.8, 0.9]
], dtype=np.float32)
# Initialize HNSW index
index = nmslib.init(method='hnsw', space='l2') # l2 is Euclidean distance
# Add embeddings to the index
index.addDataPointBatch(embeddings)
# Build the index
index.createIndex({'post': 2}, print_progress=True)
# Query vector
query_vector = np.array([[0.9, 1.9, 2.1]], dtype=np.float32)
# Search for nearest neighbors
k = 2 # Number of nearest neighbors to retrieve
indices, distances = index.knnQuery(query_vector[0], k=k)
print(f"Nearest neighbor indices: {indices}")
print(f"Distances: {distances}")
Output:
Nearest neighbor indices: [3 0]
Distances: [1.45 2.17]
This demonstrates the basic workflow for using HNSW. The index is built, embeddings are added, and a query is performed.
Choosing a Vector Database
Several excellent vector databases are available, each with its own strengths and weaknesses. Here are a few popular options:
- Pinecone: Fully managed service, excellent performance, easy to use, but can be expensive.
- Weaviate: Open-source, flexible, GraphQL API, good for complex data models.
- Milvus: Open-source, designed for massive datasets, supports various indexing algorithms.
- Qdrant: Open-source, focuses on speed and scalability, good for real-time applications.
The best choice depends on your specific requirements and budget.
Conclusion
Storing embeddings efficiently is a critical step in building a robust retrieval pipeline. While basic approaches like Python lists and NumPy arrays are possible, they donโt scale well. HNSW graphs and dedicated vector databases provide the performance and scalability needed for real-world applications. By understanding the principles behind these technologies, you can build a retrieval pipeline that delivers accurate and fast results. Remember to consider the trade-offs between cost, performance, and ease of use when choosing a storage solution.
Discover more from A Streak of Communication
Subscribe to get the latest posts sent to your email.