Index: Making Search Faster

In our previous post, we explored how to store embeddings. Now that we have a store, the next logical step is to actually search it. A naive approach would be to compare every single embedding in the store to our query embedding. This is what we call a โ€œbrute forceโ€ search. However, as the number of embeddings grows, this becomes incredibly slow. Today, weโ€™re diving into indexing โ€“ the key to making search fast.

Why is Brute Force Search So Slow?

Imagine you have a physical library with millions of books. To find a specific book using brute force, youโ€™re essentially checking every single shelf, one by one. This is incredibly inefficient. Indexing is like creating a card catalog or a table of contents โ€“ it allows you to quickly locate the information you need without checking everything.

What is an Index?

An index is a data structure that allows for fast lookup of data. In the context of our retrieval pipeline, an index is a way to organize our embeddings so we can quickly find the ones that are most similar to our query embedding. Think of it as a shortcut to the most relevant information.

How Does Indexing Work?

Indexing doesnโ€™t magically make searching faster. Itโ€™s about organizing the data in a way that allows for efficient filtering and comparison. There are many different indexing techniques, each with its own trade-offs in terms of speed, accuracy, and memory usage.

Common Indexing Techniques (Brief Overview – No Deep Dive)

  • Tree-based Indexes (e.g., KD-trees, Ball-trees): These organize embeddings into a hierarchical tree structure. While conceptually simple, they often struggle with high-dimensional data (like embeddings).
  • Hash-based Indexes (e.g., Locality Sensitive Hashing – LSH): These hash similar embeddings into the same โ€œbucket.โ€ This is fast but can lead to collisions (different embeddings in the same bucket).
  • Graph-based Indexes (e.g., Hierarchical Navigable Small World – HNSW): These create a graph where nodes represent embeddings, and edges connect similar embeddings. This is a popular choice for its balance of speed and accuracy. (Weโ€™re focusing on this today).

HNSW: A Closer Look

HNSW builds a multi-layered graph. Each layer is a simplified version of the layer below it. When searching, you start at the top layer and navigate down, quickly narrowing down the search space.

Simple Example: Building a Basic Index (Conceptual)

Letโ€™s imagine we have just five embeddings. Weโ€™re going to create a very simplified index (not a real HNSW implementation, just to illustrate the concept).

# Conceptual Example - NOT a working HNSW implementation
embeddings = [
    [0.1, 0.2, 0.3],
    [0.2, 0.3, 0.4],
    [0.7, 0.8, 0.9],
    [0.8, 0.9, 1.0],
    [0.4, 0.5, 0.6]
]

# Simplified "index" - just a sorted list based on the first dimension
indexed_embeddings = sorted(embeddings, key=lambda x: x[0])

query_embedding = [0.3, 0.4, 0.5]

# "Search" the index
# In reality, a more sophisticated distance calculation would be used
# and the search would consider all dimensions.
closest_embedding = closest_to(query_embedding, indexed_embeddings)

print(f"Closest embedding: {closest_embedding}")

This is a very simplified illustration. In a real-world scenario, the indexing process would involve more sophisticated algorithms and data structures.

A More Realistic Example: Using a Library (Annoy)

Annoy (Approximate Nearest Neighbors Oh Yeah) is a popular library for building approximate nearest neighbor search indexes. Itโ€™s relatively easy to use and provides good performance.

First, install Annoy:

pip install annoy

Now, letโ€™s build an index and add some embeddings:

from annoy import AnnoyIndex
import numpy as np

# Define the dimensions of our embeddings
dimensions = 3

# Create an Annoy index
annoy_index = AnnoyIndex(dimensions, 'angular')  # Use 'angular' for cosine similarity

# Add some embeddings to the index
for i in range(5):
    embedding = np.random.rand(dimensions)  # Generate a random embedding
    annoy_index.add_item(i, embedding)

# Build the index
annoy_index.build(5)  # Number of trees (more trees = better accuracy, slower build time)

# Search the index
query_embedding = np.random.rand(dimensions)
nearest_neighbors = annoy_index.get_nns_by_vector(query_embedding, 2) # Find the 2 nearest neighbors

print(f"Query embedding: {query_embedding}")
print(f"Nearest neighbors: {nearest_neighbors}")

This code generates random embeddings, adds them to an Annoy index, builds the index, and then searches for the nearest neighbors to a query embedding. The get_nns_by_vector function returns a list of item IDs, which can then be used to retrieve the actual embeddings from the store.

Comparing Brute Force vs. Indexed Search

Letโ€™s create a simple function to measure the performance of brute force search vs. Annoy-indexed search.

import time
import numpy as np

def brute_force_search(embeddings, query_embedding):
    start_time = time.time()
    distances = [np.linalg.norm(embedding - query_embedding) for embedding in embeddings]
    closest_index = np.argmin(distances)
    end_time = time.time()
    return closest_index, end_time - start_time

def annoy_search(annoy_index, query_embedding):
    start_time = time.time()
    nearest_neighbors = annoy_index.get_nns_by_vector(query_embedding, 2)
    end_time = time.time()
    return nearest_neighbors, end_time - start_time


# Generate a large set of embeddings
num_embeddings = 10000
embedding_dim = 3
embeddings = np.random.rand(num_embeddings, embedding_dim)

# Create an Annoy index
annoy_index = AnnoyIndex(embedding_dim, 'angular')
for i in range(num_embeddings):
    annoy_index.add_item(i, embeddings[i])
annoy_index.build(10) # Build with 10 trees


query_embedding = np.random.rand(embedding_dim)


# Measure brute force search time
closest_index_brute_force, time_brute_force = brute_force_search(embeddings, query_embedding)
print(f"Brute Force Time: {time_brute_force:.4f} seconds")

# Measure Annoy search time
nearest_neighbors, time_annoy = annoy_search(annoy_index, query_embedding)
print(f"Annoy Time: {time_annoy:.4f} seconds")

Youโ€™re likely to see a significant speedup with Annoy, especially as the number of embeddings increases.

Choosing the Right Index

The best indexing technique depends on your specific needs. Factors to consider include:

  • Accuracy: How close do the approximate nearest neighbors need to be?
  • Speed: How quickly do you need to perform searches?
  • Memory Usage: How much memory can you afford to use?
  • Build Time: How long does it take to build the index?

Common Mistakes and Debugging

  • Incorrect Distance Metric: Make sure youโ€™re using the correct distance metric (e.g., cosine similarity, Euclidean distance) for your application.
  • Insufficient Number of Trees: Building the index with too few trees can lead to inaccurate results.
  • Data Type Issues: Ensure that your embeddings are stored in the correct data type (e.g., float32).

Actionable Takeaways

  • Indexing is essential for efficient search in large datasets.
  • HNSW is a popular choice for its balance of speed and accuracy.
  • Libraries like Annoy simplify the process of building and using indexes.
  • Experiment with different indexing techniques to find the best solution for your specific needs.

This concludes our exploration of indexing for faster search. In the next post, weโ€™ve got reranking!


Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from A Streak of Communication

Subscribe now to keep reading and get access to the full archive.

Continue reading