Sharding: Splitting the Data

In our previous post, we explored the scaling challenges faced by vector databases, highlighting the limitations of relying solely on vertical scaling. We established that when dealing with truly massive datasets and demanding query loads, horizontal scaling—distributing the data across multiple machines—becomes essential. Today, we’re diving into one of the most common and powerful techniques for horizontal scaling: sharding.

Table of Contents

Background & Context: Why Shard?

Think about a library. A single librarian can manage a small collection, but as the collection grows, a single person becomes overwhelmed. The solution? Divide the books into sections, each managed by a different librarian. That’s essentially what sharding does for vector databases.

Sharding, also known as partitioning, is the process of splitting a large dataset into smaller, more manageable chunks (shards) that can be stored across multiple machines (nodes). Each node then handles a subset of the data and a portion of the query load. This distributes the burden, improves scalability, and reduces latency.

The need for sharding arises when a single machine can no longer handle the storage capacity, query throughput, or memory requirements of the data. As we discussed, scaling vertically (adding more RAM, CPU, etc. to a single machine) has its limits. Sharding allows us to bypass those limits and scale horizontally.

Core Concepts Deep Dive

1. Nodes and Shards: The Building Blocks

What it is: A shard is a subset of the overall dataset, and a node is a single machine that stores one or more shards.

Analogy: Think of it like a pizza. The entire pizza is your dataset. Each slice is a shard, and each person eating a slice is a node.

Visual Representation:

┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│   Node 1    │      │   Node 2    │      │   Node 3    │
│   Shard 1    │───►│   Shard 2    │───►│   Shard 3    │
└─────────────┘      └─────────────┘      └─────────────┘

Simple Example (Conceptual):

# Conceptual representation - not runnable directly
nodes = ["node1.example.com", "node2.example.com", "node3.example.com"]
shards = {
    "node1.example.com": [("vector1", 0.1), ("vector2", 0.2)],
    "node2.example.com": [("vector3", 0.3), ("vector4", 0.4)],
    "node3.example.com": [("vector5", 0.5), ("vector6", 0.6)]
}

# In a real system, this would be handled by a sharding framework.

What’s happening: This simple example demonstrates how data (“vector1”, “vector2”, etc.) is conceptually distributed across three nodes. In a real-world scenario, a sharding framework would manage this distribution.

Real-World Example (Illustrative):

# Illustrative example - uses dictionaries for simplicity.
# A real-world system would use a database or specialized sharding framework.
node_data = {
    "node1": [("id1", [0.1, 0.2, 0.3]), ("id2", [0.4, 0.5, 0.6])],
    "node2": [("id3", [0.7, 0.8, 0.9]), ("id4", [1.0, 1.1, 1.2])],
}

# Simulate a query to node1
print(f"Data on node1: {node_data['node1']}")

#Output:
#Data on node1: [('id1', [0.1, 0.2, 0.3]), ('id2', [0.4, 0.5, 0.6])]

Pro Tip: The key is to design your data model so that shards are relatively balanced in terms of size and query load.

2. Sharding Keys: Directing Data

What it is: A sharding key is a property of each data item used to determine which shard it belongs to.

Analogy: Think of a library’s Dewey Decimal System. Each book is assigned a number based on its subject, which determines its location on the shelves.

Visual Representation:

┌─────────────┐      ┌─────────────┐
│  Data Item  │───►│  Sharding Key │───► Shard Assignment
└─────────────┘      └─────────────┘

Simple Example:

def get_shard(data_id, num_shards):
    """Assigns a data item to a shard based on its ID."""
    return data_id % num_shards  # Simple modulo-based sharding

#Example
shard_id = get_shard(10, 3) #Assigns data item with ID 10 to shard 1
print(f"Data item with ID 5 goes to shard: {shard_id}")

#Output:
#Data item with ID 5 goes to shard: 2

What’s happening: This function uses the data item’s ID and the number of shards to determine the shard it belongs to. The modulo operator (%) ensures that the shard ID falls within the valid range (0 to num_shards – 1).

Real-World Example:

import hashlib

def consistent_hash(key, num_nodes):
    """Uses consistent hashing to distribute data across nodes."""
    return int(hashlib.md5(str(key).encode('utf-8')).hexdigest(), 16) % num_nodes

#Example
node_count = 3
shard_id = consistent_hash("user_id_123", node_count)
print(f"User with ID 'user_id_123' goes to shard: {shard_id}")

#Output:
#User with ID 'user_id_123' goes to shard: 1

Pro Tip: Consistent hashing (as shown in the example) is crucial for minimizing data movement when nodes are added or removed.

3. Hash Functions and Data Distribution

What it is: Hash functions are used to generate a numerical value (hash) from the sharding key, which is then used to determine the shard assignment.

Analogy: Think of a fingerprint. A fingerprint is a unique identifier for a person, and a hash function is like an algorithm that generates a unique “fingerprint” for a data item.

Simple Example (using Python’s built-in hash function):

data = "my_data_item"
hash_value = hash(data)
print(f"Hash value of '{data}': {hash_value}")

#Output:
#Hash value of 'my_data_item': -5420786910763520222

Real-World Example (demonstrating a more complex hash function):

import hashlib

def custom_hash(data):
    """A custom hash function using SHA-256."""
    return int(hashlib.sha256(data.encode('utf-8')).hexdigest(), 16)

data = "another_data_item"
hash_value = custom_hash(data)
print(f"Custom hash value of '{data}': {hash_value}")

#Output:
#Custom hash value of 'another_data_item': 5493472800382243886

Pro Tip: When choosing a hash function, consider its distribution properties. A good hash function should distribute data items evenly across shards.

Practical Considerations

Data Locality: Try to shard data in a way that keeps related data items on the same shard to minimize cross-shard queries.
Rebalancing: As data grows or nodes are added/removed, you’re going to need to rebalance the shards to maintain even distribution.
Query Routing: You need a mechanism to route queries to the correct shard based on the sharding key.

In our next post, we’ll explore different sharding strategies and dive deeper into the challenges of managing a sharded vector database.

Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Background & Context: Why Shard?

Core Concepts Deep Dive

1. Nodes and Shards: The Building Blocks

2. Sharding Keys: Directing Data

3. Hash Functions and Data Distribution

Practical Considerations

Like this:

Related

Discover more from A Streak of Communication

Leave a ReplyCancel reply

Background & Context: Why Shard?

Core Concepts Deep Dive

1. Nodes and Shards: The Building Blocks

2. Sharding Keys: Directing Data

3. Hash Functions and Data Distribution

Practical Considerations

Share this:

Like this:

Related

Discover more from A Streak of Communication

Check this too

Replication Strategies: Synchronous vs. Asynchronous

Replication: Ensuring Data Availability

Sharding Deep Dive: Consistent Hashing

Leave a ReplyCancel reply

Discover more from A Streak of Communication