Search: Finding the Closest Matches

In our previous post, we covered how to create an index to organize our embeddings. Now that we have an index, we need to actually use it to find the closest matches to a given query. Today, weโ€™re diving into the simplest approach: brute force search. Weโ€™re going to see how it works, why itโ€™s not always the best solution, and what its limitations are. Weโ€™re building on the previous concepts of input, embedding, storing, and indexing โ€“ but todayโ€™s focus is purely on the search process itself.

Why Brute Force?

Think of it like searching for a specific book in a library. A brute force search would mean checking every single book on the shelves to see if it matches your query. Itโ€™s guaranteed to find the right answer (if it exists), but itโ€™s also incredibly slow, especially if the library is huge.

In the context of embeddings, a brute force search involves calculating the distance between the query vector and every vector in our store. The vectors with the smallest distances are considered the closest matches.

Understanding Distance Metrics

Before we dive into the code, letโ€™s briefly touch on how we measure โ€œdistanceโ€ between vectors. Cosine similarity is a common choice, especially when dealing with text embeddings. It measures the angle between two vectors; a smaller angle (closer to 0 degrees) means higher similarity. Other metrics include Euclidean distance (straight-line distance) and dot product. The choice of metric depends on the specific application and the nature of the embeddings. Weโ€™ll use cosine similarity for our examples.

The Basics: A Simple Brute Force Search

Letโ€™s start with a very basic Python example using NumPy. Weโ€™ll assume we have a list of embeddings stored in a NumPy array.

import numpy as np

# Sample embeddings (replace with your actual embeddings)
embeddings = np.array([
    [0.1, 0.2, 0.3],
    [0.4, 0.5, 0.6],
    [0.7, 0.8, 0.9],
    [0.2, 0.3, 0.4]
])

# Query vector
query_vector = np.array([0.3, 0.4, 0.5])

# Function to calculate cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Brute force search
similarities = []
for embedding in embeddings:
    similarity = cosine_similarity(embedding, query_vector)
    similarities.append(similarity)

# Get indices of top 2 most similar embeddings
top_indices = np.argsort(similarities)[-2:][::-1]  # Get the indices of the 2 largest values

print("Similarities:", similarities)
print("Top 2 indices:", top_indices)

Explanation:

  1. Import NumPy: We use NumPy for efficient array operations.
  2. Sample Embeddings: This is placeholder data. Replace it with your actual embeddings.
  3. Query Vector: This is the vector weโ€™re searching for similar embeddings to.
  4. cosine_similarity Function: Calculates the cosine similarity between two vectors.
  5. Brute Force Search: The for loop iterates through each embedding in the embeddings array. For each embedding, it calculates the cosine similarity with the query_vector and stores the result in the similarities list.
  6. np.argsort: This function returns the indices that would sort the similarities array. [-2:] selects the last two elements (the two highest similarities). [::-1] reverses the order, so we get the indices in descending order of similarity.

This code provides a fundamental understanding of the brute force approach. However, as the number of embeddings grows, this becomes increasingly slow.

A More Realistic Example: Using a Distance Function

To improve readability and reusability, letโ€™s encapsulate the distance calculation into a separate function.

import numpy as np

# Sample embeddings (replace with your actual embeddings)
embeddings = np.array([
    [0.1, 0.2, 0.3],
    [0.4, 0.5, 0.6],
    [0.7, 0.8, 0.9],
    [0.2, 0.3, 0.4]
])

# Query vector
query_vector = np.array([0.3, 0.4, 0.5])


def calculate_distance(vector1, vector2):
    """Calculates cosine similarity between two vectors."""
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

# Brute force search
distances = [calculate_distance(embedding, query_vector) for embedding in embeddings]

# Get indices of top 2 most similar embeddings
top_indices = np.argsort(distances)[-2:][::-1]

print("Distances:", distances)
print("Top 2 indices:", top_indices)

Explanation:

  1. calculate_distance Function: This function encapsulates the cosine similarity calculation, making the code cleaner and easier to understand.
  2. List Comprehension: We use a list comprehension [calculate_distance(embedding, query_vector) for embedding in embeddings] to efficiently calculate the distances for all embeddings.

The Problem with Brute Force: Scalability

The primary limitation of brute force search is its scalability. The time complexity is O(N), where N is the number of embeddings. This means that as the number of embeddings doubles, the search time also roughly doubles. For very large datasets (millions or billions of embeddings), brute force search becomes impractical.

Imagine a library with millions of books. Checking every single book to find a specific title would take an incredibly long time. Thatโ€™s the problem brute force search faces.

Visualizing the Bottleneck

(Image: A graph showing the relationship between the number of embeddings and the search time for a brute force search. The graph shows a linear relationship, demonstrating the O(N) complexity.)

The graph visually demonstrates the linear relationship between the number of embeddings and the search time. As the number of embeddings increases, the search time increases proportionally.

Conclusion: When is Brute Force Acceptable?

While brute force search is inefficient for large datasets, it can be acceptable in certain situations:

  • Small Datasets: If the number of embeddings is relatively small (e.g., less than 1000), the performance overhead might be negligible.
  • Baseline Comparison: It provides a simple baseline to compare against more sophisticated search algorithms.
  • Initial Prototyping: Itโ€™s a quick and easy way to get started with a search system before optimizing for performance.

However, for any real-world application with a large number of embeddings, more efficient search algorithms like HNSW (which we covered briefly in the previous day) are essential. These algorithms sacrifice some accuracy to achieve significantly faster search times. The trade-off is worth it for large-scale applications.


Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from A Streak of Communication

Subscribe now to keep reading and get access to the full archive.

Continue reading