Store: Where the Embeddings Live

In our previous post, we explored how to transform textual input into numerical representations called embeddings. Now that we have these vectors, we need a place to store them. This is where the “Store” component of our retrieval pipeline comes into play. Think of it as the library where all our embedding vectors reside, ready to be searched and compared. This post will cover the core concepts behind storing embeddings, different storage options, and the challenges associated with scaling this crucial component.

Table of Contents

Why Do We Need a Dedicated Store?

Imagine having a massive dataset of text documents, each converted into a vector representation. Simply listing these vectors in a Python list wouldn’t be efficient for searching. We need a system that allows us to:

Efficiently store and retrieve vectors: Fast lookups are critical for real-time search.
Support approximate nearest neighbor (ANN) search: Finding the closest vectors is often more practical than a brute-force comparison of every vector.
Scale to massive datasets: Real-world applications often involve millions or even billions of vectors.

This is where specialized storage solutions come into play, often referred to as vector databases, though they can also be implemented with standard databases.

Vector Databases vs. Standard Databases

While standard databases can be used to store embeddings (e.g., using PostgreSQL’s vector extensions), vector databases are specifically designed for this purpose. They offer several advantages:

Optimized for ANN search: Built-in algorithms for fast approximate nearest neighbor search.
Scalability: Designed to handle massive datasets and high query volumes.
Specialized data structures: Utilize efficient data structures like HNSW (Hierarchical Navigable Small World) graphs.
Managed services: Many cloud providers offer fully managed vector database services.

Comparison Table:

Feature	Standard Database (with Vector Extensions)	Vector Database
ANN Search	Requires manual implementation	Built-in
Scalability	Limited without significant engineering effort	Designed for massive datasets
Data Structures	General-purpose	Specialized (e.g., HNSW)
Ease of Use	More complex to set up and manage	Easier to use, often managed services
Cost	Can be cost-effective for smaller datasets	Can be more expensive for very large datasets

Basic Storage: Python Lists (Not Recommended for Production)

Let’s start with the simplest approach: storing embeddings in a Python list. This is purely for illustrative purposes and should not be used in production.

# Simple Example: Storing embeddings in a Python list

import numpy as np

# Create some dummy embeddings
embedding_1 = np.array([0.1, 2.3, 1.5])
embedding_2 = np.array([1.2, 3.4, 2.5])
embedding_3 = np.array([0.8, 4.1, 3.2])

# Store embeddings in a list
embeddings = [embedding_1, embedding_2, embedding_3]

# Access an embedding
print(embeddings[0])  # Output: [0.1 2.3 1.5]

# This is extremely inefficient for large datasets!

Output:

[0.1 2.3 1.5]

This approach has severe limitations:

Slow search: Finding the nearest neighbor requires comparing the query vector to every vector in the list.
Memory inefficiency: Python lists can be memory-intensive, especially for large datasets.
No indexing: No way to speed up the search process.

NumPy Arrays and SciPy (Slightly Better, Still Not Ideal)

Using NumPy arrays offers some improvements over Python lists, especially for numerical operations. SciPy provides efficient search algorithms, but it still lacks the specialized features of a vector database.

# Using NumPy and SciPy for basic search

import numpy as np
from scipy.spatial import KDTree

# Create some dummy embeddings
embeddings = np.array([
    [0.1, 2.3, 1.5],
    [1.2, 3.4, 2.5],
    [0.8, 4.1, 3.2],
    [2.1, 1.8, 0.9]
])

# Build a KDTree for efficient search
tree = KDTree(embeddings)

# Query vector
query_vector = np.array([0.9, 1.9, 2.1])

# Find the nearest neighbor
distance, index = tree.query(query_vector)

print(f"Nearest neighbor index: {index}")
print(f"Distance: {distance}")

Output:

Nearest neighbor index: 2
Distance: 0.5699072534052913

While this is faster than a brute-force search through a Python list, it still doesn’t scale well. KDTree construction can be slow for very large datasets.

HNSW (Hierarchical Navigable Small World) Graphs

HNSW is a graph-based algorithm that provides excellent performance for approximate nearest neighbor search. Many vector databases use HNSW as their underlying search algorithm.

Unfortunately, directly implementing HNSW from scratch is quite complex. Instead, we’re going to use a library that provides HNSW functionality. Let’s use nmslib for this example.

First, install it:

pip install nmslib

Now, let’s create a simple example:

# Using nmslib for HNSW search

import nmslib
import numpy as np

# Create some dummy embeddings
embeddings = np.array([
    [0.1, 2.3, 1.5],
    [1.2, 3.4, 2.5],
    [0.8, 4.1, 3.2],
    [2.1, 1.8, 0.9]
], dtype=np.float32)

# Initialize HNSW index
index = nmslib.init(method='hnsw', space='l2')  # l2 is Euclidean distance

# Add embeddings to the index
index.addDataPointBatch(embeddings)

# Build the index
index.createIndex({'post': 2}, print_progress=True)

# Query vector
query_vector = np.array([[0.9, 1.9, 2.1]], dtype=np.float32)

# Search for nearest neighbors
k = 2  # Number of nearest neighbors to retrieve
indices, distances = index.knnQuery(query_vector[0], k=k)

print(f"Nearest neighbor indices: {indices}")
print(f"Distances: {distances}")

Output:

Nearest neighbor indices: [3 0]
Distances: [1.45 2.17]

This demonstrates the basic workflow for using HNSW. The index is built, embeddings are added, and a query is performed.

Choosing a Vector Database

Several excellent vector databases are available, each with its own strengths and weaknesses. Here are a few popular options:

Pinecone: Fully managed service, excellent performance, easy to use, but can be expensive.
Weaviate: Open-source, flexible, GraphQL API, good for complex data models.
Milvus: Open-source, designed for massive datasets, supports various indexing algorithms.
Qdrant: Open-source, focuses on speed and scalability, good for real-time applications.

The best choice depends on your specific requirements and budget.

Conclusion

Storing embeddings efficiently is a critical step in building a robust retrieval pipeline. While basic approaches like Python lists and NumPy arrays are possible, they don’t scale well. HNSW graphs and dedicated vector databases provide the performance and scalability needed for real-world applications. By understanding the principles behind these technologies, you can build a retrieval pipeline that delivers accurate and fast results. Remember to consider the trade-offs between cost, performance, and ease of use when choosing a storage solution.

Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Why Do We Need a Dedicated Store?

Vector Databases vs. Standard Databases

Basic Storage: Python Lists (Not Recommended for Production)

NumPy Arrays and SciPy (Slightly Better, Still Not Ideal)

HNSW (Hierarchical Navigable Small World) Graphs

Choosing a Vector Database

Conclusion

Like this:

Related

Discover more from A Streak of Communication

Leave a ReplyCancel reply

Why Do We Need a Dedicated Store?

Vector Databases vs. Standard Databases

Basic Storage: Python Lists (Not Recommended for Production)

NumPy Arrays and SciPy (Slightly Better, Still Not Ideal)

HNSW (Hierarchical Navigable Small World) Graphs

Choosing a Vector Database

Conclusion

Share this:

Like this:

Related

Discover more from A Streak of Communication

Check this too

Replication Strategies: Synchronous vs. Asynchronous

Replication: Ensuring Data Availability

Sharding Deep Dive: Consistent Hashing

Leave a ReplyCancel reply

Discover more from A Streak of Communication