Sharding: Splitting the Data

In our previous post, we explored the scaling challenges faced by vector databases, highlighting the limitations of relying solely on vertical scaling. We established that when dealing with truly massive datasets and demanding query loads, horizontal scalingโ€”distributing the data across multiple machinesโ€”becomes essential. Today, weโ€™re diving into one of the most common and powerful techniques for horizontal scaling: sharding.

Background & Context: Why Shard?

Think about a library. A single librarian can manage a small collection, but as the collection grows, a single person becomes overwhelmed. The solution? Divide the books into sections, each managed by a different librarian. Thatโ€™s essentially what sharding does for vector databases.

Sharding, also known as partitioning, is the process of splitting a large dataset into smaller, more manageable chunks (shards) that can be stored across multiple machines (nodes). Each node then handles a subset of the data and a portion of the query load. This distributes the burden, improves scalability, and reduces latency.

The need for sharding arises when a single machine can no longer handle the storage capacity, query throughput, or memory requirements of the data. As we discussed, scaling vertically (adding more RAM, CPU, etc. to a single machine) has its limits. Sharding allows us to bypass those limits and scale horizontally.

Core Concepts Deep Dive

1. Nodes and Shards: The Building Blocks

What it is: A shard is a subset of the overall dataset, and a node is a single machine that stores one or more shards.

Analogy: Think of it like a pizza. The entire pizza is your dataset. Each slice is a shard, and each person eating a slice is a node.

Visual Representation:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Node 1    โ”‚      โ”‚   Node 2    โ”‚      โ”‚   Node 3    โ”‚
โ”‚   Shard 1    โ”‚โ”€โ”€โ”€โ–บโ”‚   Shard 2    โ”‚โ”€โ”€โ”€โ–บโ”‚   Shard 3    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Simple Example (Conceptual):

# Conceptual representation - not runnable directly
nodes = ["node1.example.com", "node2.example.com", "node3.example.com"]
shards = {
    "node1.example.com": [("vector1", 0.1), ("vector2", 0.2)],
    "node2.example.com": [("vector3", 0.3), ("vector4", 0.4)],
    "node3.example.com": [("vector5", 0.5), ("vector6", 0.6)]
}

# In a real system, this would be handled by a sharding framework.

Whatโ€™s happening: This simple example demonstrates how data (โ€œvector1โ€, โ€œvector2โ€, etc.) is conceptually distributed across three nodes. In a real-world scenario, a sharding framework would manage this distribution.

Real-World Example (Illustrative):

# Illustrative example - uses dictionaries for simplicity.
# A real-world system would use a database or specialized sharding framework.
node_data = {
    "node1": [("id1", [0.1, 0.2, 0.3]), ("id2", [0.4, 0.5, 0.6])],
    "node2": [("id3", [0.7, 0.8, 0.9]), ("id4", [1.0, 1.1, 1.2])],
}

# Simulate a query to node1
print(f"Data on node1: {node_data['node1']}")

#Output:
#Data on node1: [('id1', [0.1, 0.2, 0.3]), ('id2', [0.4, 0.5, 0.6])]

Pro Tip: The key is to design your data model so that shards are relatively balanced in terms of size and query load.

2. Sharding Keys: Directing Data

What it is: A sharding key is a property of each data item used to determine which shard it belongs to.

Analogy: Think of a libraryโ€™s Dewey Decimal System. Each book is assigned a number based on its subject, which determines its location on the shelves.

Visual Representation:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Data Item  โ”‚โ”€โ”€โ”€โ–บโ”‚  Sharding Key โ”‚โ”€โ”€โ”€โ–บ Shard Assignment
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Simple Example:

def get_shard(data_id, num_shards):
    """Assigns a data item to a shard based on its ID."""
    return data_id % num_shards  # Simple modulo-based sharding

#Example
shard_id = get_shard(10, 3) #Assigns data item with ID 10 to shard 1
print(f"Data item with ID 5 goes to shard: {shard_id}")

#Output:
#Data item with ID 5 goes to shard: 2

Whatโ€™s happening: This function uses the data itemโ€™s ID and the number of shards to determine the shard it belongs to. The modulo operator (%) ensures that the shard ID falls within the valid range (0 to num_shards – 1).

Real-World Example:

import hashlib

def consistent_hash(key, num_nodes):
    """Uses consistent hashing to distribute data across nodes."""
    return int(hashlib.md5(str(key).encode('utf-8')).hexdigest(), 16) % num_nodes

#Example
node_count = 3
shard_id = consistent_hash("user_id_123", node_count)
print(f"User with ID 'user_id_123' goes to shard: {shard_id}")

#Output:
#User with ID 'user_id_123' goes to shard: 1

Pro Tip: Consistent hashing (as shown in the example) is crucial for minimizing data movement when nodes are added or removed.

3. Hash Functions and Data Distribution

What it is: Hash functions are used to generate a numerical value (hash) from the sharding key, which is then used to determine the shard assignment.

Analogy: Think of a fingerprint. A fingerprint is a unique identifier for a person, and a hash function is like an algorithm that generates a unique โ€œfingerprintโ€ for a data item.

Simple Example (using Pythonโ€™s built-in hash function):

data = "my_data_item"
hash_value = hash(data)
print(f"Hash value of '{data}': {hash_value}")

#Output:
#Hash value of 'my_data_item': -5420786910763520222

Real-World Example (demonstrating a more complex hash function):

import hashlib

def custom_hash(data):
    """A custom hash function using SHA-256."""
    return int(hashlib.sha256(data.encode('utf-8')).hexdigest(), 16)

data = "another_data_item"
hash_value = custom_hash(data)
print(f"Custom hash value of '{data}': {hash_value}")

#Output:
#Custom hash value of 'another_data_item': 5493472800382243886

Pro Tip: When choosing a hash function, consider its distribution properties. A good hash function should distribute data items evenly across shards.

Practical Considerations

  • Data Locality: Try to shard data in a way that keeps related data items on the same shard to minimize cross-shard queries.
  • Rebalancing: As data grows or nodes are added/removed, youโ€™re going to need to rebalance the shards to maintain even distribution.
  • Query Routing: You need a mechanism to route queries to the correct shard based on the sharding key.

In our next post, weโ€™ll explore different sharding strategies and dive deeper into the challenges of managing a sharded vector database.


Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from A Streak of Communication

Subscribe now to keep reading and get access to the full archive.

Continue reading