In our previous post, we covered how to create an index to organize our embeddings. Now that we have an index, we need to actually use it to find the closest matches to a given query. Today, weโre diving into the simplest approach: brute force search. Weโre going to see how it works, why itโs not always the best solution, and what its limitations are. Weโre building on the previous concepts of input, embedding, storing, and indexing โ but todayโs focus is purely on the search process itself.
Table of Contents
Why Brute Force?
Think of it like searching for a specific book in a library. A brute force search would mean checking every single book on the shelves to see if it matches your query. Itโs guaranteed to find the right answer (if it exists), but itโs also incredibly slow, especially if the library is huge.
In the context of embeddings, a brute force search involves calculating the distance between the query vector and every vector in our store. The vectors with the smallest distances are considered the closest matches.
Understanding Distance Metrics
Before we dive into the code, letโs briefly touch on how we measure โdistanceโ between vectors. Cosine similarity is a common choice, especially when dealing with text embeddings. It measures the angle between two vectors; a smaller angle (closer to 0 degrees) means higher similarity. Other metrics include Euclidean distance (straight-line distance) and dot product. The choice of metric depends on the specific application and the nature of the embeddings. Weโll use cosine similarity for our examples.
The Basics: A Simple Brute Force Search
Letโs start with a very basic Python example using NumPy. Weโll assume we have a list of embeddings stored in a NumPy array.
import numpy as np
# Sample embeddings (replace with your actual embeddings)
embeddings = np.array([
[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6],
[0.7, 0.8, 0.9],
[0.2, 0.3, 0.4]
])
# Query vector
query_vector = np.array([0.3, 0.4, 0.5])
# Function to calculate cosine similarity
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Brute force search
similarities = []
for embedding in embeddings:
similarity = cosine_similarity(embedding, query_vector)
similarities.append(similarity)
# Get indices of top 2 most similar embeddings
top_indices = np.argsort(similarities)[-2:][::-1] # Get the indices of the 2 largest values
print("Similarities:", similarities)
print("Top 2 indices:", top_indices)
Explanation:
- Import NumPy: We use NumPy for efficient array operations.
- Sample Embeddings: This is placeholder data. Replace it with your actual embeddings.
- Query Vector: This is the vector weโre searching for similar embeddings to.
cosine_similarityFunction: Calculates the cosine similarity between two vectors.- Brute Force Search: The
forloop iterates through each embedding in theembeddingsarray. For each embedding, it calculates the cosine similarity with thequery_vectorand stores the result in thesimilaritieslist. np.argsort: This function returns the indices that would sort thesimilaritiesarray.[-2:]selects the last two elements (the two highest similarities).[::-1]reverses the order, so we get the indices in descending order of similarity.
This code provides a fundamental understanding of the brute force approach. However, as the number of embeddings grows, this becomes increasingly slow.
A More Realistic Example: Using a Distance Function
To improve readability and reusability, letโs encapsulate the distance calculation into a separate function.
import numpy as np
# Sample embeddings (replace with your actual embeddings)
embeddings = np.array([
[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6],
[0.7, 0.8, 0.9],
[0.2, 0.3, 0.4]
])
# Query vector
query_vector = np.array([0.3, 0.4, 0.5])
def calculate_distance(vector1, vector2):
"""Calculates cosine similarity between two vectors."""
return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
# Brute force search
distances = [calculate_distance(embedding, query_vector) for embedding in embeddings]
# Get indices of top 2 most similar embeddings
top_indices = np.argsort(distances)[-2:][::-1]
print("Distances:", distances)
print("Top 2 indices:", top_indices)
Explanation:
calculate_distanceFunction: This function encapsulates the cosine similarity calculation, making the code cleaner and easier to understand.- List Comprehension: We use a list comprehension
[calculate_distance(embedding, query_vector) for embedding in embeddings]to efficiently calculate the distances for all embeddings.
The Problem with Brute Force: Scalability
The primary limitation of brute force search is its scalability. The time complexity is O(N), where N is the number of embeddings. This means that as the number of embeddings doubles, the search time also roughly doubles. For very large datasets (millions or billions of embeddings), brute force search becomes impractical.
Imagine a library with millions of books. Checking every single book to find a specific title would take an incredibly long time. Thatโs the problem brute force search faces.
Visualizing the Bottleneck
(Image: A graph showing the relationship between the number of embeddings and the search time for a brute force search. The graph shows a linear relationship, demonstrating the O(N) complexity.)
The graph visually demonstrates the linear relationship between the number of embeddings and the search time. As the number of embeddings increases, the search time increases proportionally.
Conclusion: When is Brute Force Acceptable?
While brute force search is inefficient for large datasets, it can be acceptable in certain situations:
- Small Datasets: If the number of embeddings is relatively small (e.g., less than 1000), the performance overhead might be negligible.
- Baseline Comparison: It provides a simple baseline to compare against more sophisticated search algorithms.
- Initial Prototyping: Itโs a quick and easy way to get started with a search system before optimizing for performance.
However, for any real-world application with a large number of embeddings, more efficient search algorithms like HNSW (which we covered briefly in the previous day) are essential. These algorithms sacrifice some accuracy to achieve significantly faster search times. The trade-off is worth it for large-scale applications.
Discover more from A Streak of Communication
Subscribe to get the latest posts sent to your email.