In our previous post, we explored how vectors are organized within a vector database. We saw the limitations of a flat index when dealing with large datasets. Today, weโre tackling a crucial step that often precedes even vectorization: document chunking. Large documents โ think entire books, long research papers, or extensive legal contracts โ canโt be directly converted into vectors. They need to be broken down into smaller, more manageable pieces. This process, known as document chunking, is essential for efficient indexing and retrieval.
Table of Contents
Background & Context
Imagine trying to fit a whole novel into a single vector. The resulting vector would be massive, unwieldy, and likely lose the nuances of the story. Document chunking is about finding the right balance: creating chunks that are large enough to retain context but small enough to be effectively vectorized and indexed. Itโs a critical step in the overall vector search pipeline. Without proper chunking, you risk losing vital information or creating vectors that are too noisy to be useful.
Core Concepts Deep Dive
Letโs explore the key concepts involved in document chunking.
- Chunk Size: This refers to the maximum number of tokens (words or sub-words) in a chunk. Common sizes range from 256 to 1024 tokens, but the optimal size depends on the document type and the embedding model used.
- Overlapping Chunks: To preserve context between chunks, we often use overlapping. This means that consecutive chunks share a portion of the original text.
- Context: The surrounding information that helps understand the meaning of a piece of text. Proper chunking aims to maintain as much context as possible within each chunk.
Chunking Strategies: Finding the Right Approach
Several strategies can be employed for document chunking. Letโs look at some common techniques.
1. Fixed-Size Chunking: The Simplest Approach
This is the most straightforward method, where documents are split into chunks of a predetermined size.
def fixed_size_chunking(text, chunk_size):
"""Splits text into fixed-size chunks.
Args:
text: The text to chunk.
chunk_size: The maximum number of tokens per chunk.
Returns:
A list of strings, where each string is a chunk.
"""
tokens = text.split() # Simple tokenization - can be improved
chunks = []
for i in range(0, len(tokens), chunk_size):
chunks.append(" ".join(tokens[i:i + chunk_size]))
return chunks
# Example
document = "This is a sample document. It contains several sentences.
We want to chunk it into smaller pieces. Chunking is important for vectorization."
chunk_size = 5
chunks = fixed_size_chunking(document, chunk_size)
print(chunks)
# Expected Output: ['This', 'is', 'a', 'sample', 'document.', 'It',
'contains', 'several', 'sentences.', 'We', 'want', 'to', 'chunk',
'it', 'into', 'smaller', 'pieces.', 'Chunking', 'is', 'important',
'for', 'vectorization.']
While simple, this approach can disrupt sentences and lose context. Imagine a crucial piece of information being cut off mid-sentence.
2. Sentence-Based Chunking: Preserving Sentence Boundaries
A better approach is to split documents based on sentence boundaries. This ensures that each chunk contains complete sentences, preserving context.
import nltk
def sentence_based_chunking(text, chunk_size):
"""Splits text into chunks based on sentences.
Args:
text: The text to chunk.
chunk_size: The maximum number of sentences per chunk.
Returns:
A list of strings, where each string is a chunk.
"""
sentences = nltk.sent_tokenize(text)
chunks = []
current_chunk = ""
sentence_count = 0
for sentence in sentences:
if sentence_count < chunk_size:
current_chunk += sentence + " "
sentence_count += 1
else:
chunks.append(current_chunk.strip())
current_chunk = sentence + " "
sentence_count = 1
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
# Example
document = "This is the first sentence. This is the second sentence.
This is the third sentence.
This is the fourth sentence. This is the fifth sentence."
chunk_size = 2
chunks = sentence_based_chunking(document, chunk_size)
print(chunks)
# Expected Output: ['This is the first sentence. This is the second sentence.',
'This is the third sentence. This is the fourth sentence.']
This method is a significant improvement but still might result in chunks that are too long. Note that nltk.sent_tokenize requires nltk to be installed (pip install nltk) and the data to be downloaded (nltk.download('punkt')).
3. Recursive Chunking: Handling Long Documents
For extremely long documents, a recursive chunking approach can be beneficial. This involves splitting documents into smaller chunks based on a combination of sentence boundaries and a maximum chunk size.
def recursive_chunking(text, max_chunk_size, sentence_separator="."):
"""Recursively chunks text to ensure chunks don't exceed max_chunk_size.
Args:
text: The text to chunk.
max_chunk_size: The maximum number of tokens per chunk.
sentence_separator: Separator used to split sentences.
Returns:
A list of strings, where each string is a chunk.
"""
sentences = text.split(sentence_separator)
chunks = []
current_chunk = ""
for sentence in sentences:
if len(current_chunk.split()) + len(sentence.split()) <= max_chunk_size:
current_chunk += sentence + sentence_separator
else:
chunks.append(current_chunk.strip())
current_chunk = sentence + sentence_separator
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
# Example
document = "This is the first sentence. This is the second sentence.
This is a very long sentence that spans multiple words and continues
to elaborate on a complex idea. This is the fourth sentence."
max_chunk_size = 20
chunks = recursive_chunking(document, max_chunk_size)
print(chunks)
# Expected Output: ['This is the first sentence. This is the second sentence.',
'This is a very long sentence that spans multiple words and continues to
elaborate on a complex idea.', 'This is the fourth sentence.']
This strategy aims to balance sentence boundaries with the maximum chunk size, creating more manageable chunks.
Overlapping Chunks: Maintaining Context
To ensure that context isnโt lost between chunks, itโs crucial to use overlapping. This means that consecutive chunks share a portion of the original text. The amount of overlap depends on the specific application and the embedding model used.
def overlapping_chunking(text, chunk_size, overlap):
"""Splits text into overlapping chunks.
Args:
text: The text to chunk.
chunk_size: The maximum number of tokens per chunk.
overlap: The number of tokens to overlap between chunks.
Returns:
A list of strings, where each string is a chunk.
"""
tokens = text.split()
chunks = []
start = 0
while start < len(tokens):
end = min(start + chunk_size, len(tokens))
chunk = " ".join(tokens[start:end])
chunks.append(chunk)
start += chunk_size - overlap
return chunks
# Example
document = "This is a sample document. It contains several sentences.
We want to chunk it into smaller pieces."
chunk_size = 10
overlap = 2
chunks = overlapping_chunking(document, chunk_size, overlap)
print(chunks)
# Expected Output: ['This is a sample document.', 'is a sample document.']
Conclusion
Document chunking is a foundational step in the vector search pipeline. By carefully selecting a chunking strategy and incorporating overlapping, we can ensure that our vectors accurately represent the information contained within the original documents and maintain the necessary context for effective retrieval. Choosing the right approach requires considering the documentโs length, structure, and the specific requirements of the application. Remember to experiment and iterate to find the optimal chunking strategy for your use case. In the next post, weโve will explore how to combine these chunks with metadata to create a more robust and searchable knowledge base.
Discover more from A Streak of Communication
Subscribe to get the latest posts sent to your email.