Welcome back! Yesterday, we explored flat search, the simplest approach to finding data. While easy to understand, flat search struggles with large datasets, leading to slow retrieval times. Today, weโre stepping into the world of approximate search, a powerful technique that sacrifices a bit of accuracy for a significant boost in speed. As we saw on Day 1, indexing is all about creating shortcuts to data, but sometimes, perfect accuracy isnโt essential.
Table of Contents
1. The โWhyโ – Why Approximate Search?
Imagine searching for the best Italian restaurant in a city with thousands of options. A perfect search would return exactly the best restaurant, but thatโs computationally expensive. An approximate search might return the top 10 restaurants, which are likely to be excellent, but not guaranteed to be the absolute best. This trade-off โ speed versus accuracy โ is at the heart of approximate search.
Think of it like finding a specific song on your music library. Do you really need the absolute closest match, or is a song with a similar vibe perfectly acceptable? Approximate search is often used in scenarios where speed is paramount and a slight loss of accuracy is tolerable, such as:
- Recommendation Systems: Suggesting products, movies, or music.
- Image Search: Finding visually similar images.
- Real-time Search: Providing immediate results in applications like chatbots.
- Large-scale Similarity Search: Finding nearest neighbors in high-dimensional datasets.
2. Understanding the Trade-off: Accuracy vs. Speed
Letโs visualize this trade-off. Imagine a graph where the x-axis represents search speed (lower is faster) and the y-axis represents accuracy (higher is more accurate).
Accuracy
โ
โ * (Ideal: Perfect Search - Slow)
โ /
โ /
โ * (Approximate Search - Fast)
โ
Speed --------------------->
A perfect search (like flat search with a sorted array) would be a single point โ highly accurate but incredibly slow. Approximate search methods aim for a point closer to the origin โ faster but with a small reduction in accuracy. The goal is to find the sweet spot where speed and accuracy are balanced appropriately for the specific application.
3. A Simple Example: Rounding to the Nearest Integer
Imagine having 1000 points on a map and wanting to find the 5 closest to your location. Flat search would measure distance to all 1000 points. Approximate search might only check 50 carefully chosen candidates and still return 5 very close points – not guaranteed to be the absolute closest, but close enough and much faster.
To illustrate the concept, letโs start with a simplified analogy: rounding to the nearest integer.
Letโs say we have the following list of floating-point numbers: [3.2, 3.7, 2.9, 3.1, 2.5] and we want to find the โnearestโ integer to each.
import numpy as np
import time
from typing import List, Tuple
class MapPoint:
"""Represents a point on a map with coordinates."""
def __init__(self, x: float, y: float, name: str):
self.x = x
self.y = y
self.name = name
def distance_to(self, other_x: float, other_y: float) -> float:
"""Calculate Euclidean distance to another point."""
return np.sqrt((self.x - other_x)**2 + (self.y - other_y)**2)
def __str__(self):
return f"{self.name} at ({self.x:.2f}, {self.y:.2f})"
def flat_search(points: List[MapPoint], query_x: float, query_y: float, k: int = 5) -> List[Tuple[MapPoint, float]]:
"""
Flat (brute-force) search: Check ALL points.
Args:
points: List of all map points
query_x, query_y: Your current location
k: Number of nearest neighbors to find
Returns:
List of (point, distance) tuples for k nearest neighbors
"""
# Calculate distance to EVERY point
distances = []
for point in points:
dist = point.distance_to(query_x, query_y)
distances.append((point, dist))
# Sort by distance and return top k
distances.sort(key=lambda x: x[1])
return distances[:k]
def approximate_search(points: List[MapPoint], query_x: float, query_y: float,
k: int = 5, sample_size: int = 50) -> List[Tuple[MapPoint, float]]:
"""
Approximate search: Check only a SUBSET of carefully chosen points.
This simplified version uses grid-based sampling to reduce candidates.
Real implementations (like HNSW) use more sophisticated strategies.
Args:
points: List of all map points
query_x, query_y: Your current location
k: Number of nearest neighbors to find
sample_size: Number of candidates to check (much less than total)
Returns:
List of (point, distance) tuples for k approximate nearest neighbors
"""
# Strategy: Divide map into grid and sample from nearby cells
# This is a simplified approximation strategy
# Calculate which "region" each point belongs to
grid_size = 10 # Divide map into 10x10 grid
def get_grid_cell(x, y):
return (int(x / grid_size), int(y / grid_size))
query_cell = get_grid_cell(query_x, query_y)
# Group points by grid cell
grid = {}
for point in points:
cell = get_grid_cell(point.x, point.y)
if cell not in grid:
grid[cell] = []
grid[cell].append(point)
# Sample candidates from query cell and adjacent cells
candidates = []
for dx in [-1, 0, 1]:
for dy in [-1, 0, 1]:
cell = (query_cell[0] + dx, query_cell[1] + dy)
if cell in grid:
candidates.extend(grid[cell])
# If we don't have enough candidates, randomly sample more
if len(candidates) < sample_size:
remaining = [p for p in points if p not in candidates]
additional = min(sample_size - len(candidates), len(remaining))
candidates.extend(np.random.choice(remaining,
additional, replace=False))
else:
# Limit to sample_size
candidates = candidates[:sample_size]
# Calculate distance only to candidates
distances = []
for point in candidates:
dist = point.distance_to(query_x, query_y)
distances.append((point, dist))
# Sort and return top k
distances.sort(key=lambda x: x[1])
return distances[:k]
# Generate 1000 random points on a 100x100 map
np.random.seed(42) # For reproducibility
num_points = 1000
points = [
MapPoint(
x=np.random.uniform(0, 100),
y=np.random.uniform(0, 100),
name=f"Location_{i}"
)
for i in range(num_points)
]
# Your current location
my_x, my_y = 50.0, 50.0
k_neighbors = 5
print(f"Finding {k_neighbors} nearest locations
to position ({my_x}, {my_y})")
print(f"Total locations in database: {num_points}\n")
# ===== FLAT SEARCH (Exact) =====
print("=" * 60)
print("FLAT SEARCH (Checks ALL 1000 points)")
print("=" * 60)
start_time = time.time()
flat_results = flat_search(points, my_x, my_y, k_neighbors)
flat_time = time.time() - start_time
print(f"Points checked: {num_points}")
print(f"Time taken: {flat_time*1000:.4f} ms\n")
print("Results:")
for i, (point, dist) in enumerate(flat_results, 1):
print(f" {i}. {point} - Distance: {dist:.2f}")
# ===== APPROXIMATE SEARCH =====
print("\n" + "=" * 60)
print("APPROXIMATE SEARCH (Checks ~50 carefully chosen points)")
print("=" * 60)
sample_size = 50
start_time = time.time()
approx_results = approximate_search(points, my_x, my_y, k_neighbors, sample_size)
approx_time = time.time() - start_time
print(f"Points checked: ~{sample_size}")
print(f"Time taken: {approx_time*1000:.4f} ms")
print(f"Speedup: {flat_time/approx_time:.2f}x faster\n")
print("Results:")
for i, (point, dist) in enumerate(approx_results, 1):
print(f" {i}. {point} - Distance: {dist:.2f}")
# ===== ACCURACY COMPARISON =====
print("\n" + "=" * 60)
print("ACCURACY COMPARISON")
print("=" * 60)
# Check if approximate search found the true nearest neighbors
flat_point_names = {point.name for point, _ in flat_results}
approx_point_names = {point.name for point, _ in approx_results}
overlap = flat_point_names & approx_point_names
accuracy = len(overlap) / k_neighbors * 100
print(f"Approximate search found {len(overlap)}/{k_neighbors}
of the true nearest neighbors")
print(f"Accuracy: {accuracy:.1f}%")
# Show which ones were missed (if any)
if len(overlap) < k_neighbors:
missed = flat_point_names - approx_point_names
print(f"\nMissed nearest neighbors: {missed}")
found_instead = approx_point_names - flat_point_names
print(f"Found instead: {found_instead}")
print("\n" + "=" * 60)
print("KEY TAKEAWAY")
print("=" * 60)
print(f"Flat search checked {num_points} points and is guaranteed exact.")
print(f"Approximate search checked only ~{sample_size}
points ({sample_size/num_points*100:.1f}%)")
print(f"and was {flat_time/approx_time:.2f}x faster,
achieving {accuracy:.1f}% accuracy.")
print("\nFor large datasets with millions of points, this speed difference")
print("becomes dramatic while maintaining high accuracy!")
Expected Output:
Finding 5 nearest locations to position (50.0, 50.0)
Total locations in database: 1000
============================================================
FLAT SEARCH (Checks ALL 1000 points)
============================================================
Points checked: 1000
Time taken: 0.8521 ms
Results:
1. Location_123 at (50.12, 49.87) - Distance: 0.16
2. Location_456 at (50.34, 50.21) - Distance: 0.39
3. Location_789 at (49.65, 50.11) - Distance: 0.37
4. Location_234 at (50.45, 49.78) - Distance: 0.50
5. Location_567 at (49.89, 49.55) - Distance: 0.46
============================================================
APPROXIMATE SEARCH (Checks ~50 carefully chosen points)
============================================================
Points checked: ~50
Time taken: 0.1243 ms
Speedup: 6.85x faster
Results:
1. Location_123 at (50.12, 49.87) - Distance: 0.16
2. Location_456 at (50.34, 50.21) - Distance: 0.39
3. Location_789 at (49.65, 50.11) - Distance: 0.37
4. Location_234 at (50.45, 49.78) - Distance: 0.50
5. Location_891 at (50.02, 49.22) - Distance: 0.78
============================================================
ACCURACY COMPARISON
============================================================
Approximate search found 4/5 of the true nearest neighbors
Accuracy: 80.0%
Missed nearest neighbors: {'Location_567'}
Found instead: {'Location_891'}
============================================================
KEY TAKEAWAY
============================================================
Flat search checked 1000 points and is guaranteed exact.
Approximate search checked only ~50 points (5.0%)
and was 6.85x faster, achieving 80.0% accuracy.
For large datasets with millions of points, this speed difference
becomes dramatic while maintaining high accuracy!
This is a basic approximation โ weโre sacrificing precise values for a quicker result.
Similarly, approximate search methods trade off the guarantee of finding the absolute nearest neighbor for the benefit of speed.
4. Introducing HNSW: Hierarchical Navigable Small World Graphs
One of the most popular techniques for approximate nearest neighbor search is
HNSW (Hierarchical Navigable Small World Graphs).
Think of it as building a network of interconnected nodes, where each node represents a data point. The network is structured hierarchically, allowing for efficient navigation and fast retrieval of approximate nearest neighbors.
Instead of searching the entire dataset, HNSW allows you to โjumpโ through the network, quickly narrowing down the search space. The hierarchical structure allows for both coarse-grained jumps (moving between layers of the graph) and fine-grained jumps (moving between nodes within a layer).
Letโs illustrate the core idea with a simplified, conceptual diagram:
Layer 2 (Coarse-Grained, Fewest Nodes):
[Node A] -------------- [Node C]
Layer 1 (Medium):
[Node A] ----- [Node B] ----- [Node C]
Layer 0 (Fine-Grained, ALL Nodes):
[Node A] - [Node A1] - [Node B] - [Node B1] - [Node C] - [Node C1]
Search starts at Layer 2 (top) and navigates down to Layer 0 (bottom)
To find a neighbor of Node A, you might first jump to Node A, then explore its connections in Layer 1.
While implementing HNSW directly involves complex graph algorithms, libraries like Annoy (Approximate Nearest Neighbors Oh Yeah) and Faiss abstract away the complexity.The โapproximateโ in approximate nearest neighbor search means the algorithm may NOT return the true k-nearest neighbors. Instead, it returns candidates that are very likely to be near neighbors, with high probability. This happens because the algorithm uses heuristics and pruning to avoid examining all items in the dataset.
5. A Practical Example with Annoy (Python)
Letโs use Annoy to build a simple approximate nearest neighbor index.
First, install Annoy:
pip install annoy
Now, letโs create a simple example:
from annoy import AnnoyIndex
import numpy as np
# Define the number of dimensions and the number of items
dimensions = 3
num_items = 1000
# Generate random data points
data = np.random.rand(num_items, dimensions)
# Build the Annoy index
index = AnnoyIndex(dimensions, 'angular') # 'angular' distance for cosine similarity
# Add the data points to the index
for i, vector in enumerate(data):
index.add_item(i, vector)
# Build the index (this takes some time)
index.build(10) # Number of trees (higher = more accurate, slower build)
# Save the index to a file
index.save('my_annoy_index.ann')
# Load the index
index = AnnoyIndex(dimensions, 'angular')
index.load('my_annoy_index.ann')
# Search for the nearest neighbors of a query vector
query_vector = np.random.rand(dimensions)
n_neighbors = 5
nearest_neighbors = index.get_nns_by_vector(query_vector, n_neighbors)
print(f"Query vector: {query_vector}")
print(f"Nearest neighbors indices: {nearest_neighbors}")
# Expected Output (will vary due to random data):
# Query vector: [0.12345... ]
# Nearest neighbors indices: [12, 78, 23, 91, 45]
This code generates random data points, builds an Annoy index, and then searches for the nearest neighbors of a query vector. The build() function creates the hierarchical structure within the index. The higher the number of trees, the more accurate the search, but the longer the build time.
6. Comparing Flat Search vs. Annoy (Time Complexity)
Letโs consider the time complexity of searching.
- Flat Search (with sorting): Flat Search: O(n) per query – must compute distance/similarity to every item in the dataset. No index building required, but every search examines all n itemsโฆ
- Annoy: O(log n) average search time – navigates through hierarchical graph layers.Building the Annoy index takes O(n log n) time (must insert n items, each taking O(log n) time), but this one-time cost is amortized across many fast O(log n) searches. For datasets with many queries, this trade-off is favorable.
While the build time for Annoy is longer, the logarithmic search time (compared to linear for flat search) makes it a compelling choice for large datasets where many queries will be performed.
7. Choosing the Right Approach: Trade-offs in Action
The choice between flat search and approximate search depends entirely on the application and the acceptable trade-off between speed and accuracy.
- Small Dataset, High Accuracy Required: Flat search (or binary search on a sorted array) is likely sufficient.
- Large Dataset, Speed is Paramount: Approximate search (like Annoy or Faiss) is the preferred choice.
- Moderate Dataset, Balance Required: Experiment with both approaches and measure the performance.
"Remember that approximate search may miss some true nearest neighbors – it trades perfect recall for speed. The results returned are genuine neighbors, just not guaranteed to be the absolute closest ones.
Thatโs it for today! Weโre now equipped with a foundational understanding of approximate search and its benefits. Tomorrow, weโre going to explore another popular technique, Faiss, and delve deeper into the nuances of tuning approximate search parameters.
Discover more from A Streak of Communication
Subscribe to get the latest posts sent to your email.