In our previous post, we explored how to store embeddings. Now that we have a store, the next logical step is to actually search it. A naive approach would be to compare every single embedding in the store to our query embedding. This is what we call a โbrute forceโ search. However, as the number of embeddings grows, this becomes incredibly slow. Today, weโre diving into indexing โ the key to making search fast.
Table of Contents
Why is Brute Force Search So Slow?
Imagine you have a physical library with millions of books. To find a specific book using brute force, youโre essentially checking every single shelf, one by one. This is incredibly inefficient. Indexing is like creating a card catalog or a table of contents โ it allows you to quickly locate the information you need without checking everything.
What is an Index?
An index is a data structure that allows for fast lookup of data. In the context of our retrieval pipeline, an index is a way to organize our embeddings so we can quickly find the ones that are most similar to our query embedding. Think of it as a shortcut to the most relevant information.
How Does Indexing Work?
Indexing doesnโt magically make searching faster. Itโs about organizing the data in a way that allows for efficient filtering and comparison. There are many different indexing techniques, each with its own trade-offs in terms of speed, accuracy, and memory usage.
Common Indexing Techniques (Brief Overview – No Deep Dive)
- Tree-based Indexes (e.g., KD-trees, Ball-trees): These organize embeddings into a hierarchical tree structure. While conceptually simple, they often struggle with high-dimensional data (like embeddings).
- Hash-based Indexes (e.g., Locality Sensitive Hashing – LSH): These hash similar embeddings into the same โbucket.โ This is fast but can lead to collisions (different embeddings in the same bucket).
- Graph-based Indexes (e.g., Hierarchical Navigable Small World – HNSW): These create a graph where nodes represent embeddings, and edges connect similar embeddings. This is a popular choice for its balance of speed and accuracy. (Weโre focusing on this today).
HNSW: A Closer Look
HNSW builds a multi-layered graph. Each layer is a simplified version of the layer below it. When searching, you start at the top layer and navigate down, quickly narrowing down the search space.
Simple Example: Building a Basic Index (Conceptual)
Letโs imagine we have just five embeddings. Weโre going to create a very simplified index (not a real HNSW implementation, just to illustrate the concept).
# Conceptual Example - NOT a working HNSW implementation
embeddings = [
[0.1, 0.2, 0.3],
[0.2, 0.3, 0.4],
[0.7, 0.8, 0.9],
[0.8, 0.9, 1.0],
[0.4, 0.5, 0.6]
]
# Simplified "index" - just a sorted list based on the first dimension
indexed_embeddings = sorted(embeddings, key=lambda x: x[0])
query_embedding = [0.3, 0.4, 0.5]
# "Search" the index
# In reality, a more sophisticated distance calculation would be used
# and the search would consider all dimensions.
closest_embedding = closest_to(query_embedding, indexed_embeddings)
print(f"Closest embedding: {closest_embedding}")
This is a very simplified illustration. In a real-world scenario, the indexing process would involve more sophisticated algorithms and data structures.
A More Realistic Example: Using a Library (Annoy)
Annoy (Approximate Nearest Neighbors Oh Yeah) is a popular library for building approximate nearest neighbor search indexes. Itโs relatively easy to use and provides good performance.
First, install Annoy:
pip install annoy
Now, letโs build an index and add some embeddings:
from annoy import AnnoyIndex
import numpy as np
# Define the dimensions of our embeddings
dimensions = 3
# Create an Annoy index
annoy_index = AnnoyIndex(dimensions, 'angular') # Use 'angular' for cosine similarity
# Add some embeddings to the index
for i in range(5):
embedding = np.random.rand(dimensions) # Generate a random embedding
annoy_index.add_item(i, embedding)
# Build the index
annoy_index.build(5) # Number of trees (more trees = better accuracy, slower build time)
# Search the index
query_embedding = np.random.rand(dimensions)
nearest_neighbors = annoy_index.get_nns_by_vector(query_embedding, 2) # Find the 2 nearest neighbors
print(f"Query embedding: {query_embedding}")
print(f"Nearest neighbors: {nearest_neighbors}")
This code generates random embeddings, adds them to an Annoy index, builds the index, and then searches for the nearest neighbors to a query embedding. The get_nns_by_vector function returns a list of item IDs, which can then be used to retrieve the actual embeddings from the store.
Comparing Brute Force vs. Indexed Search
Letโs create a simple function to measure the performance of brute force search vs. Annoy-indexed search.
import time
import numpy as np
def brute_force_search(embeddings, query_embedding):
start_time = time.time()
distances = [np.linalg.norm(embedding - query_embedding) for embedding in embeddings]
closest_index = np.argmin(distances)
end_time = time.time()
return closest_index, end_time - start_time
def annoy_search(annoy_index, query_embedding):
start_time = time.time()
nearest_neighbors = annoy_index.get_nns_by_vector(query_embedding, 2)
end_time = time.time()
return nearest_neighbors, end_time - start_time
# Generate a large set of embeddings
num_embeddings = 10000
embedding_dim = 3
embeddings = np.random.rand(num_embeddings, embedding_dim)
# Create an Annoy index
annoy_index = AnnoyIndex(embedding_dim, 'angular')
for i in range(num_embeddings):
annoy_index.add_item(i, embeddings[i])
annoy_index.build(10) # Build with 10 trees
query_embedding = np.random.rand(embedding_dim)
# Measure brute force search time
closest_index_brute_force, time_brute_force = brute_force_search(embeddings, query_embedding)
print(f"Brute Force Time: {time_brute_force:.4f} seconds")
# Measure Annoy search time
nearest_neighbors, time_annoy = annoy_search(annoy_index, query_embedding)
print(f"Annoy Time: {time_annoy:.4f} seconds")
Youโre likely to see a significant speedup with Annoy, especially as the number of embeddings increases.
Choosing the Right Index
The best indexing technique depends on your specific needs. Factors to consider include:
- Accuracy: How close do the approximate nearest neighbors need to be?
- Speed: How quickly do you need to perform searches?
- Memory Usage: How much memory can you afford to use?
- Build Time: How long does it take to build the index?
Common Mistakes and Debugging
- Incorrect Distance Metric: Make sure youโre using the correct distance metric (e.g., cosine similarity, Euclidean distance) for your application.
- Insufficient Number of Trees: Building the index with too few trees can lead to inaccurate results.
- Data Type Issues: Ensure that your embeddings are stored in the correct data type (e.g.,
float32).
Actionable Takeaways
- Indexing is essential for efficient search in large datasets.
- HNSW is a popular choice for its balance of speed and accuracy.
- Libraries like Annoy simplify the process of building and using indexes.
- Experiment with different indexing techniques to find the best solution for your specific needs.
This concludes our exploration of indexing for faster search. In the next post, weโve got reranking!
Discover more from A Streak of Communication
Subscribe to get the latest posts sent to your email.