In our previous post, we explored the scaling challenges faced by vector databases, highlighting the limitations of relying solely on vertical scaling. We established that when dealing with truly massive datasets and demanding query loads, horizontal scalingโdistributing the data across multiple machinesโbecomes essential. Today, weโre diving into one of the most common and powerful techniques for horizontal scaling: sharding.
Table of Contents
Background & Context: Why Shard?
Think about a library. A single librarian can manage a small collection, but as the collection grows, a single person becomes overwhelmed. The solution? Divide the books into sections, each managed by a different librarian. Thatโs essentially what sharding does for vector databases.
Sharding, also known as partitioning, is the process of splitting a large dataset into smaller, more manageable chunks (shards) that can be stored across multiple machines (nodes). Each node then handles a subset of the data and a portion of the query load. This distributes the burden, improves scalability, and reduces latency.
The need for sharding arises when a single machine can no longer handle the storage capacity, query throughput, or memory requirements of the data. As we discussed, scaling vertically (adding more RAM, CPU, etc. to a single machine) has its limits. Sharding allows us to bypass those limits and scale horizontally.
Core Concepts Deep Dive
1. Nodes and Shards: The Building Blocks
What it is: A shard is a subset of the overall dataset, and a node is a single machine that stores one or more shards.
Analogy: Think of it like a pizza. The entire pizza is your dataset. Each slice is a shard, and each person eating a slice is a node.
Visual Representation:
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Node 1 โ โ Node 2 โ โ Node 3 โ
โ Shard 1 โโโโโบโ Shard 2 โโโโโบโ Shard 3 โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
Simple Example (Conceptual):
# Conceptual representation - not runnable directly
nodes = ["node1.example.com", "node2.example.com", "node3.example.com"]
shards = {
"node1.example.com": [("vector1", 0.1), ("vector2", 0.2)],
"node2.example.com": [("vector3", 0.3), ("vector4", 0.4)],
"node3.example.com": [("vector5", 0.5), ("vector6", 0.6)]
}
# In a real system, this would be handled by a sharding framework.
Whatโs happening: This simple example demonstrates how data (โvector1โ, โvector2โ, etc.) is conceptually distributed across three nodes. In a real-world scenario, a sharding framework would manage this distribution.
Real-World Example (Illustrative):
# Illustrative example - uses dictionaries for simplicity.
# A real-world system would use a database or specialized sharding framework.
node_data = {
"node1": [("id1", [0.1, 0.2, 0.3]), ("id2", [0.4, 0.5, 0.6])],
"node2": [("id3", [0.7, 0.8, 0.9]), ("id4", [1.0, 1.1, 1.2])],
}
# Simulate a query to node1
print(f"Data on node1: {node_data['node1']}")
#Output:
#Data on node1: [('id1', [0.1, 0.2, 0.3]), ('id2', [0.4, 0.5, 0.6])]
Pro Tip: The key is to design your data model so that shards are relatively balanced in terms of size and query load.
2. Sharding Keys: Directing Data
What it is: A sharding key is a property of each data item used to determine which shard it belongs to.
Analogy: Think of a libraryโs Dewey Decimal System. Each book is assigned a number based on its subject, which determines its location on the shelves.
Visual Representation:
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Data Item โโโโโบโ Sharding Key โโโโโบ Shard Assignment
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
Simple Example:
def get_shard(data_id, num_shards):
"""Assigns a data item to a shard based on its ID."""
return data_id % num_shards # Simple modulo-based sharding
#Example
shard_id = get_shard(10, 3) #Assigns data item with ID 10 to shard 1
print(f"Data item with ID 5 goes to shard: {shard_id}")
#Output:
#Data item with ID 5 goes to shard: 2
Whatโs happening: This function uses the data itemโs ID and the number of shards to determine the shard it belongs to. The modulo operator (%) ensures that the shard ID falls within the valid range (0 to num_shards – 1).
Real-World Example:
import hashlib
def consistent_hash(key, num_nodes):
"""Uses consistent hashing to distribute data across nodes."""
return int(hashlib.md5(str(key).encode('utf-8')).hexdigest(), 16) % num_nodes
#Example
node_count = 3
shard_id = consistent_hash("user_id_123", node_count)
print(f"User with ID 'user_id_123' goes to shard: {shard_id}")
#Output:
#User with ID 'user_id_123' goes to shard: 1
Pro Tip: Consistent hashing (as shown in the example) is crucial for minimizing data movement when nodes are added or removed.
3. Hash Functions and Data Distribution
What it is: Hash functions are used to generate a numerical value (hash) from the sharding key, which is then used to determine the shard assignment.
Analogy: Think of a fingerprint. A fingerprint is a unique identifier for a person, and a hash function is like an algorithm that generates a unique โfingerprintโ for a data item.
Simple Example (using Pythonโs built-in hash function):
data = "my_data_item"
hash_value = hash(data)
print(f"Hash value of '{data}': {hash_value}")
#Output:
#Hash value of 'my_data_item': -5420786910763520222
Real-World Example (demonstrating a more complex hash function):
import hashlib
def custom_hash(data):
"""A custom hash function using SHA-256."""
return int(hashlib.sha256(data.encode('utf-8')).hexdigest(), 16)
data = "another_data_item"
hash_value = custom_hash(data)
print(f"Custom hash value of '{data}': {hash_value}")
#Output:
#Custom hash value of 'another_data_item': 5493472800382243886
Pro Tip: When choosing a hash function, consider its distribution properties. A good hash function should distribute data items evenly across shards.
Practical Considerations
- Data Locality: Try to shard data in a way that keeps related data items on the same shard to minimize cross-shard queries.
- Rebalancing: As data grows or nodes are added/removed, youโre going to need to rebalance the shards to maintain even distribution.
- Query Routing: You need a mechanism to route queries to the correct shard based on the sharding key.
In our next post, weโll explore different sharding strategies and dive deeper into the challenges of managing a sharded vector database.
Discover more from A Streak of Communication
Subscribe to get the latest posts sent to your email.