Index: Making Search Faster

In our previous post, we explored how to store embeddings. Now that we have a store, the next logical step is to actually search it. A naive approach would be to compare every single embedding in the store to our query embedding. This is what we call a “brute force” search. However, as the number of embeddings grows, this becomes incredibly slow. Today, we’re diving into indexing – the key to making search fast.

Table of Contents

Why is Brute Force Search So Slow?

Imagine you have a physical library with millions of books. To find a specific book using brute force, you’re essentially checking every single shelf, one by one. This is incredibly inefficient. Indexing is like creating a card catalog or a table of contents – it allows you to quickly locate the information you need without checking everything.

What is an Index?

An index is a data structure that allows for fast lookup of data. In the context of our retrieval pipeline, an index is a way to organize our embeddings so we can quickly find the ones that are most similar to our query embedding. Think of it as a shortcut to the most relevant information.

How Does Indexing Work?

Indexing doesn’t magically make searching faster. It’s about organizing the data in a way that allows for efficient filtering and comparison. There are many different indexing techniques, each with its own trade-offs in terms of speed, accuracy, and memory usage.

Common Indexing Techniques (Brief Overview – No Deep Dive)

Tree-based Indexes (e.g., KD-trees, Ball-trees): These organize embeddings into a hierarchical tree structure. While conceptually simple, they often struggle with high-dimensional data (like embeddings).
Hash-based Indexes (e.g., Locality Sensitive Hashing – LSH): These hash similar embeddings into the same “bucket.” This is fast but can lead to collisions (different embeddings in the same bucket).
Graph-based Indexes (e.g., Hierarchical Navigable Small World – HNSW): These create a graph where nodes represent embeddings, and edges connect similar embeddings. This is a popular choice for its balance of speed and accuracy. (We’re focusing on this today).

HNSW: A Closer Look

HNSW builds a multi-layered graph. Each layer is a simplified version of the layer below it. When searching, you start at the top layer and navigate down, quickly narrowing down the search space.

Simple Example: Building a Basic Index (Conceptual)

Let’s imagine we have just five embeddings. We’re going to create a very simplified index (not a real HNSW implementation, just to illustrate the concept).

# Conceptual Example - NOT a working HNSW implementation
embeddings = [
    [0.1, 0.2, 0.3],
    [0.2, 0.3, 0.4],
    [0.7, 0.8, 0.9],
    [0.8, 0.9, 1.0],
    [0.4, 0.5, 0.6]
]

# Simplified "index" - just a sorted list based on the first dimension
indexed_embeddings = sorted(embeddings, key=lambda x: x[0])

query_embedding = [0.3, 0.4, 0.5]

# "Search" the index
# In reality, a more sophisticated distance calculation would be used
# and the search would consider all dimensions.
closest_embedding = closest_to(query_embedding, indexed_embeddings)

print(f"Closest embedding: {closest_embedding}")

This is a very simplified illustration. In a real-world scenario, the indexing process would involve more sophisticated algorithms and data structures.

A More Realistic Example: Using a Library (Annoy)

Annoy (Approximate Nearest Neighbors Oh Yeah) is a popular library for building approximate nearest neighbor search indexes. It’s relatively easy to use and provides good performance.

First, install Annoy:

pip install annoy

Now, let’s build an index and add some embeddings:

from annoy import AnnoyIndex
import numpy as np

# Define the dimensions of our embeddings
dimensions = 3

# Create an Annoy index
annoy_index = AnnoyIndex(dimensions, 'angular')  # Use 'angular' for cosine similarity

# Add some embeddings to the index
for i in range(5):
    embedding = np.random.rand(dimensions)  # Generate a random embedding
    annoy_index.add_item(i, embedding)

# Build the index
annoy_index.build(5)  # Number of trees (more trees = better accuracy, slower build time)

# Search the index
query_embedding = np.random.rand(dimensions)
nearest_neighbors = annoy_index.get_nns_by_vector(query_embedding, 2) # Find the 2 nearest neighbors

print(f"Query embedding: {query_embedding}")
print(f"Nearest neighbors: {nearest_neighbors}")

This code generates random embeddings, adds them to an Annoy index, builds the index, and then searches for the nearest neighbors to a query embedding. The get_nns_by_vector function returns a list of item IDs, which can then be used to retrieve the actual embeddings from the store.

Comparing Brute Force vs. Indexed Search

Let’s create a simple function to measure the performance of brute force search vs. Annoy-indexed search.

import time
import numpy as np

def brute_force_search(embeddings, query_embedding):
    start_time = time.time()
    distances = [np.linalg.norm(embedding - query_embedding) for embedding in embeddings]
    closest_index = np.argmin(distances)
    end_time = time.time()
    return closest_index, end_time - start_time

def annoy_search(annoy_index, query_embedding):
    start_time = time.time()
    nearest_neighbors = annoy_index.get_nns_by_vector(query_embedding, 2)
    end_time = time.time()
    return nearest_neighbors, end_time - start_time


# Generate a large set of embeddings
num_embeddings = 10000
embedding_dim = 3
embeddings = np.random.rand(num_embeddings, embedding_dim)

# Create an Annoy index
annoy_index = AnnoyIndex(embedding_dim, 'angular')
for i in range(num_embeddings):
    annoy_index.add_item(i, embeddings[i])
annoy_index.build(10) # Build with 10 trees


query_embedding = np.random.rand(embedding_dim)


# Measure brute force search time
closest_index_brute_force, time_brute_force = brute_force_search(embeddings, query_embedding)
print(f"Brute Force Time: {time_brute_force:.4f} seconds")

# Measure Annoy search time
nearest_neighbors, time_annoy = annoy_search(annoy_index, query_embedding)
print(f"Annoy Time: {time_annoy:.4f} seconds")

You’re likely to see a significant speedup with Annoy, especially as the number of embeddings increases.

Choosing the Right Index

The best indexing technique depends on your specific needs. Factors to consider include:

Accuracy: How close do the approximate nearest neighbors need to be?
Speed: How quickly do you need to perform searches?
Memory Usage: How much memory can you afford to use?
Build Time: How long does it take to build the index?

Common Mistakes and Debugging

Incorrect Distance Metric: Make sure you’re using the correct distance metric (e.g., cosine similarity, Euclidean distance) for your application.
Insufficient Number of Trees: Building the index with too few trees can lead to inaccurate results.
Data Type Issues: Ensure that your embeddings are stored in the correct data type (e.g., float32).

Actionable Takeaways

Indexing is essential for efficient search in large datasets.
HNSW is a popular choice for its balance of speed and accuracy.
Libraries like Annoy simplify the process of building and using indexes.
Experiment with different indexing techniques to find the best solution for your specific needs.

This concludes our exploration of indexing for faster search. In the next post, we’ve got reranking!

Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Why is Brute Force Search So Slow?

What is an Index?

How Does Indexing Work?

Common Indexing Techniques (Brief Overview – No Deep Dive)

HNSW: A Closer Look

Simple Example: Building a Basic Index (Conceptual)

A More Realistic Example: Using a Library (Annoy)

Comparing Brute Force vs. Indexed Search

Choosing the Right Index

Common Mistakes and Debugging

Actionable Takeaways

Like this:

Related

Discover more from A Streak of Communication

Leave a ReplyCancel reply

Why is Brute Force Search So Slow?

What is an Index?

How Does Indexing Work?

Common Indexing Techniques (Brief Overview – No Deep Dive)

HNSW: A Closer Look

Simple Example: Building a Basic Index (Conceptual)

A More Realistic Example: Using a Library (Annoy)

Comparing Brute Force vs. Indexed Search

Choosing the Right Index

Common Mistakes and Debugging

Actionable Takeaways

Share this:

Like this:

Related

Discover more from A Streak of Communication

Check this too

Replication Strategies: Synchronous vs. Asynchronous

Replication: Ensuring Data Availability

Sharding Deep Dive: Consistent Hashing

Leave a ReplyCancel reply

Discover more from A Streak of Communication