Introducing Approximate Search: Speed vs. Accuracy (Day 3/7)

Welcome back! Yesterday, we explored flat search, the simplest approach to finding data. While easy to understand, flat search struggles with large datasets, leading to slow retrieval times. Today, weโ€™re stepping into the world of approximate search, a powerful technique that sacrifices a bit of accuracy for a significant boost in speed. As we saw on Day 1, indexing is all about creating shortcuts to data, but sometimes, perfect accuracy isnโ€™t essential.

1. The โ€œWhyโ€ – Why Approximate Search?

Imagine searching for the best Italian restaurant in a city with thousands of options. A perfect search would return exactly the best restaurant, but thatโ€™s computationally expensive. An approximate search might return the top 10 restaurants, which are likely to be excellent, but not guaranteed to be the absolute best. This trade-off โ€“ speed versus accuracy โ€“ is at the heart of approximate search.

Think of it like finding a specific song on your music library. Do you really need the absolute closest match, or is a song with a similar vibe perfectly acceptable? Approximate search is often used in scenarios where speed is paramount and a slight loss of accuracy is tolerable, such as:

  • Recommendation Systems: Suggesting products, movies, or music.
  • Image Search: Finding visually similar images.
  • Real-time Search: Providing immediate results in applications like chatbots.
  • Large-scale Similarity Search: Finding nearest neighbors in high-dimensional datasets.

2. Understanding the Trade-off: Accuracy vs. Speed

Letโ€™s visualize this trade-off. Imagine a graph where the x-axis represents search speed (lower is faster) and the y-axis represents accuracy (higher is more accurate).

Accuracy
โ”‚
โ”‚    * (Ideal: Perfect Search - Slow)
โ”‚   /
โ”‚  /
โ”‚ * (Approximate Search - Fast)
โ”‚
Speed --------------------->

A perfect search (like flat search with a sorted array) would be a single point โ€“ highly accurate but incredibly slow. Approximate search methods aim for a point closer to the origin โ€“ faster but with a small reduction in accuracy. The goal is to find the sweet spot where speed and accuracy are balanced appropriately for the specific application.

3. A Simple Example: Rounding to the Nearest Integer

Imagine having 1000 points on a map and wanting to find the 5 closest to your location. Flat search would measure distance to all 1000 points. Approximate search might only check 50 carefully chosen candidates and still return 5 very close points – not guaranteed to be the absolute closest, but close enough and much faster.

To illustrate the concept, letโ€™s start with a simplified analogy: rounding to the nearest integer.

Letโ€™s say we have the following list of floating-point numbers: [3.2, 3.7, 2.9, 3.1, 2.5] and we want to find the โ€œnearestโ€ integer to each.

import numpy as np
import time
from typing import List, Tuple

class MapPoint:
    """Represents a point on a map with coordinates."""
    def __init__(self, x: float, y: float, name: str):
        self.x = x
        self.y = y
        self.name = name
    
    def distance_to(self, other_x: float, other_y: float) -> float:
        """Calculate Euclidean distance to another point."""
        return np.sqrt((self.x - other_x)**2 + (self.y - other_y)**2)
    
    def __str__(self):
        return f"{self.name} at ({self.x:.2f}, {self.y:.2f})"


def flat_search(points: List[MapPoint], query_x: float, query_y: float, k: int = 5) -> List[Tuple[MapPoint, float]]:
    """
    Flat (brute-force) search: Check ALL points.
    
    Args:
        points: List of all map points
        query_x, query_y: Your current location
        k: Number of nearest neighbors to find
    
    Returns:
        List of (point, distance) tuples for k nearest neighbors
    """
    # Calculate distance to EVERY point
    distances = []
    for point in points:
        dist = point.distance_to(query_x, query_y)
        distances.append((point, dist))
    
    # Sort by distance and return top k
    distances.sort(key=lambda x: x[1])
    return distances[:k]


def approximate_search(points: List[MapPoint], query_x: float, query_y: float,
 k: int = 5, sample_size: int = 50) -> List[Tuple[MapPoint, float]]:
    """
    Approximate search: Check only a SUBSET of carefully chosen points.
    
    This simplified version uses grid-based sampling to reduce candidates.
    Real implementations (like HNSW) use more sophisticated strategies.
    
    Args:
        points: List of all map points
        query_x, query_y: Your current location
        k: Number of nearest neighbors to find
        sample_size: Number of candidates to check (much less than total)
    
    Returns:
        List of (point, distance) tuples for k approximate nearest neighbors
    """
    # Strategy: Divide map into grid and sample from nearby cells
    # This is a simplified approximation strategy
    
    # Calculate which "region" each point belongs to
    grid_size = 10  # Divide map into 10x10 grid
    
    def get_grid_cell(x, y):
        return (int(x / grid_size), int(y / grid_size))
    
    query_cell = get_grid_cell(query_x, query_y)
    
    # Group points by grid cell
    grid = {}
    for point in points:
        cell = get_grid_cell(point.x, point.y)
        if cell not in grid:
            grid[cell] = []
        grid[cell].append(point)
    
    # Sample candidates from query cell and adjacent cells
    candidates = []
    for dx in [-1, 0, 1]:
        for dy in [-1, 0, 1]:
            cell = (query_cell[0] + dx, query_cell[1] + dy)
            if cell in grid:
                candidates.extend(grid[cell])
    
    # If we don't have enough candidates, randomly sample more
    if len(candidates) < sample_size:
        remaining = [p for p in points if p not in candidates]
        additional = min(sample_size - len(candidates), len(remaining))
        candidates.extend(np.random.choice(remaining, 
additional, replace=False))
    else:
        # Limit to sample_size
        candidates = candidates[:sample_size]
    
    # Calculate distance only to candidates
    distances = []
    for point in candidates:
        dist = point.distance_to(query_x, query_y)
        distances.append((point, dist))
    
    # Sort and return top k
    distances.sort(key=lambda x: x[1])
    return distances[:k]


# Generate 1000 random points on a 100x100 map
np.random.seed(42)  # For reproducibility
num_points = 1000
points = [
    MapPoint(
        x=np.random.uniform(0, 100),
        y=np.random.uniform(0, 100),
        name=f"Location_{i}"
    )
    for i in range(num_points)
]

# Your current location
my_x, my_y = 50.0, 50.0
k_neighbors = 5

print(f"Finding {k_neighbors} nearest locations 
to position ({my_x}, {my_y})")
print(f"Total locations in database: {num_points}\n")

# ===== FLAT SEARCH (Exact) =====
print("=" * 60)
print("FLAT SEARCH (Checks ALL 1000 points)")
print("=" * 60)
start_time = time.time()
flat_results = flat_search(points, my_x, my_y, k_neighbors)
flat_time = time.time() - start_time

print(f"Points checked: {num_points}")
print(f"Time taken: {flat_time*1000:.4f} ms\n")
print("Results:")
for i, (point, dist) in enumerate(flat_results, 1):
    print(f"  {i}. {point} - Distance: {dist:.2f}")

# ===== APPROXIMATE SEARCH =====
print("\n" + "=" * 60)
print("APPROXIMATE SEARCH (Checks ~50 carefully chosen points)")
print("=" * 60)
sample_size = 50
start_time = time.time()
approx_results = approximate_search(points, my_x, my_y, k_neighbors, sample_size)
approx_time = time.time() - start_time

print(f"Points checked: ~{sample_size}")
print(f"Time taken: {approx_time*1000:.4f} ms")
print(f"Speedup: {flat_time/approx_time:.2f}x faster\n")
print("Results:")
for i, (point, dist) in enumerate(approx_results, 1):
    print(f"  {i}. {point} - Distance: {dist:.2f}")

# ===== ACCURACY COMPARISON =====
print("\n" + "=" * 60)
print("ACCURACY COMPARISON")
print("=" * 60)

# Check if approximate search found the true nearest neighbors
flat_point_names = {point.name for point, _ in flat_results}
approx_point_names = {point.name for point, _ in approx_results}
overlap = flat_point_names & approx_point_names

accuracy = len(overlap) / k_neighbors * 100
print(f"Approximate search found {len(overlap)}/{k_neighbors} 
of the true nearest neighbors")
print(f"Accuracy: {accuracy:.1f}%")

# Show which ones were missed (if any)
if len(overlap) < k_neighbors:
    missed = flat_point_names - approx_point_names
    print(f"\nMissed nearest neighbors: {missed}")
    found_instead = approx_point_names - flat_point_names
    print(f"Found instead: {found_instead}")

print("\n" + "=" * 60)
print("KEY TAKEAWAY")
print("=" * 60)
print(f"Flat search checked {num_points} points and is guaranteed exact.")
print(f"Approximate search checked only ~{sample_size} 
points ({sample_size/num_points*100:.1f}%)")
print(f"and was {flat_time/approx_time:.2f}x faster, 
achieving {accuracy:.1f}% accuracy.")
print("\nFor large datasets with millions of points, this speed difference")
print("becomes dramatic while maintaining high accuracy!")

Expected Output:

Finding 5 nearest locations to position (50.0, 50.0)
Total locations in database: 1000

============================================================
FLAT SEARCH (Checks ALL 1000 points)
============================================================
Points checked: 1000
Time taken: 0.8521 ms

Results:
  1. Location_123 at (50.12, 49.87) - Distance: 0.16
  2. Location_456 at (50.34, 50.21) - Distance: 0.39
  3. Location_789 at (49.65, 50.11) - Distance: 0.37
  4. Location_234 at (50.45, 49.78) - Distance: 0.50
  5. Location_567 at (49.89, 49.55) - Distance: 0.46

============================================================
APPROXIMATE SEARCH (Checks ~50 carefully chosen points)
============================================================
Points checked: ~50
Time taken: 0.1243 ms
Speedup: 6.85x faster

Results:
  1. Location_123 at (50.12, 49.87) - Distance: 0.16
  2. Location_456 at (50.34, 50.21) - Distance: 0.39
  3. Location_789 at (49.65, 50.11) - Distance: 0.37
  4. Location_234 at (50.45, 49.78) - Distance: 0.50
  5. Location_891 at (50.02, 49.22) - Distance: 0.78

============================================================
ACCURACY COMPARISON
============================================================
Approximate search found 4/5 of the true nearest neighbors
Accuracy: 80.0%

Missed nearest neighbors: {'Location_567'}
Found instead: {'Location_891'}

============================================================
KEY TAKEAWAY
============================================================
Flat search checked 1000 points and is guaranteed exact.
Approximate search checked only ~50 points (5.0%)
and was 6.85x faster, achieving 80.0% accuracy.

For large datasets with millions of points, this speed difference
becomes dramatic while maintaining high accuracy!

This is a basic approximation โ€“ weโ€™re sacrificing precise values for a quicker result.
Similarly, approximate search methods trade off the guarantee of finding the absolute nearest neighbor for the benefit of speed.

4. Introducing HNSW: Hierarchical Navigable Small World Graphs

One of the most popular techniques for approximate nearest neighbor search is HNSW (Hierarchical Navigable Small World Graphs).
Think of it as building a network of interconnected nodes, where each node represents a data point. The network is structured hierarchically, allowing for efficient navigation and fast retrieval of approximate nearest neighbors.

Instead of searching the entire dataset, HNSW allows you to โ€œjumpโ€ through the network, quickly narrowing down the search space. The hierarchical structure allows for both coarse-grained jumps (moving between layers of the graph) and fine-grained jumps (moving between nodes within a layer).

Letโ€™s illustrate the core idea with a simplified, conceptual diagram:

Layer 2 (Coarse-Grained, Fewest Nodes):
[Node A] -------------- [Node C]

Layer 1 (Medium):
[Node A] ----- [Node B] ----- [Node C]

Layer 0 (Fine-Grained, ALL Nodes):
[Node A] - [Node A1] - [Node B] - [Node B1] - [Node C] - [Node C1]

Search starts at Layer 2 (top) and navigates down to Layer 0 (bottom)

To find a neighbor of Node A, you might first jump to Node A, then explore its connections in Layer 1.

While implementing HNSW directly involves complex graph algorithms, libraries like Annoy (Approximate Nearest Neighbors Oh Yeah) and Faiss abstract away the complexity.The โ€˜approximateโ€™ in approximate nearest neighbor search means the algorithm may NOT return the true k-nearest neighbors. Instead, it returns candidates that are very likely to be near neighbors, with high probability. This happens because the algorithm uses heuristics and pruning to avoid examining all items in the dataset.

5. A Practical Example with Annoy (Python)

Letโ€™s use Annoy to build a simple approximate nearest neighbor index.

First, install Annoy:

pip install annoy

Now, letโ€™s create a simple example:

from annoy import AnnoyIndex
import numpy as np

# Define the number of dimensions and the number of items
dimensions = 3
num_items = 1000

# Generate random data points
data = np.random.rand(num_items, dimensions)

# Build the Annoy index
index = AnnoyIndex(dimensions, 'angular') # 'angular' distance for cosine similarity

# Add the data points to the index
for i, vector in enumerate(data):
    index.add_item(i, vector)

# Build the index (this takes some time)
index.build(10)  # Number of trees (higher = more accurate, slower build)

# Save the index to a file
index.save('my_annoy_index.ann')

# Load the index
index = AnnoyIndex(dimensions, 'angular')
index.load('my_annoy_index.ann')

# Search for the nearest neighbors of a query vector
query_vector = np.random.rand(dimensions)
n_neighbors = 5
nearest_neighbors = index.get_nns_by_vector(query_vector, n_neighbors)

print(f"Query vector: {query_vector}")
print(f"Nearest neighbors indices: {nearest_neighbors}")

# Expected Output (will vary due to random data):
# Query vector: [0.12345... ]
# Nearest neighbors indices: [12, 78, 23, 91, 45]

This code generates random data points, builds an Annoy index, and then searches for the nearest neighbors of a query vector. The build() function creates the hierarchical structure within the index. The higher the number of trees, the more accurate the search, but the longer the build time.

6. Comparing Flat Search vs. Annoy (Time Complexity)

Letโ€™s consider the time complexity of searching.

  • Flat Search (with sorting): Flat Search: O(n) per query – must compute distance/similarity to every item in the dataset. No index building required, but every search examines all n itemsโ€ฆ
  • Annoy: O(log n) average search time – navigates through hierarchical graph layers.Building the Annoy index takes O(n log n) time (must insert n items, each taking O(log n) time), but this one-time cost is amortized across many fast O(log n) searches. For datasets with many queries, this trade-off is favorable.

While the build time for Annoy is longer, the logarithmic search time (compared to linear for flat search) makes it a compelling choice for large datasets where many queries will be performed.

7. Choosing the Right Approach: Trade-offs in Action

The choice between flat search and approximate search depends entirely on the application and the acceptable trade-off between speed and accuracy.

  • Small Dataset, High Accuracy Required: Flat search (or binary search on a sorted array) is likely sufficient.
  • Large Dataset, Speed is Paramount: Approximate search (like Annoy or Faiss) is the preferred choice.
  • Moderate Dataset, Balance Required: Experiment with both approaches and measure the performance.

"Remember that approximate search may miss some true nearest neighbors – it trades perfect recall for speed. The results returned are genuine neighbors, just not guaranteed to be the absolute closest ones.

Thatโ€™s it for today! Weโ€™re now equipped with a foundational understanding of approximate search and its benefits. Tomorrow, weโ€™re going to explore another popular technique, Faiss, and delve deeper into the nuances of tuning approximate search parameters.


Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Discover more from A Streak of Communication

Subscribe now to keep reading and get access to the full archive.

Continue reading