You've parsed your documents into text. But embedding entire documents (turning the whole text into one list-of-numbers vector, as in Day 19) isn't effective — one vector can't represent a 10,000-word doc that spans many topics. The solution? Chunking - splitting text into smaller, meaningful pieces.
Coming from Software Engineering? Text chunking is like pagination or sharding — you're breaking large data into smaller, manageable pieces. If you've implemented paginated APIs or database sharding strategies, the tradeoffs are similar: chunk too small and you lose context, chunk too large and you waste resources.
Why Chunk?
Chunking Benefits
| Aspect | Without Chunks | With Chunks |
|---|---|---|
| Search precision | Low | High |
| Context relevance | Mixed | Focused |
| Token usage | High | Optimal |
| Cost | Higher | Lower |
Basic Chunking: Fixed Size
The simplest approach - split every N characters:
# script_id: day_025_text_chunking/fixed_size_chunking
def chunk_by_characters(text: str, chunk_size: int = 1000, overlap: int = 200) -> list[str]:
"""Split text into fixed-size chunks with overlap."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap # Overlap to maintain context
return chunks
# Usage
text = "Your very long document text here..." * 100
chunks = chunk_by_characters(text, chunk_size=500, overlap=50)
print(f"Created {len(chunks)} chunks")
Overlap means each chunk repeats the last N characters of the previous one. Without it, a sentence split across a boundary is cut in half in both chunks, and a query matching that sentence may miss both. With overlap, the straddling text lives fully in at least one chunk.
Problems with Fixed Size
Fixed size can split mid-sentence or mid-word!
Smart Chunking: Recursive Character Splitter
Split at natural boundaries (paragraphs, sentences, words):
# script_id: day_025_text_chunking/recursive_character_splitter
import re
def recursive_chunk(
text: str,
chunk_size: int = 1000,
overlap: int = 200,
separators: list[str] = None
) -> list[str]:
"""
Recursively split text using multiple separators.
Tries to split at paragraph > sentence > word boundaries.
"""
if separators is None:
separators = ["\n\n", "\n", ". ", " ", ""]
# Note: overlap is accepted for API symmetry but not applied in this
# recursive splitter; see the fixed-size chunker above for an overlap implementation.
chunks = []
current_sep = separators[0]
remaining_seps = separators[1:]
# Base case: text fits in chunk
if len(text) <= chunk_size:
return [text.strip()] if text.strip() else []
# Try to split with current separator
splits = text.split(current_sep)
current_chunk = ""
for split in splits:
test_chunk = current_chunk + current_sep + split if current_chunk else split
if len(test_chunk) <= chunk_size:
current_chunk = test_chunk
else:
# Save current chunk if valid
if current_chunk:
chunks.append(current_chunk.strip())
# Handle split that's too large
if len(split) > chunk_size and remaining_seps:
# Recursively split with smaller separator
sub_chunks = recursive_chunk(split, chunk_size, overlap, remaining_seps)
chunks.extend(sub_chunks)
current_chunk = ""
else:
current_chunk = split
# Don't forget the last chunk
if current_chunk:
chunks.append(current_chunk.strip())
return [c for c in chunks if c] # Remove empty chunks
# Usage
document = """
# Introduction
This is the first paragraph about machine learning.
It contains important information about AI.
# Methods
The second section describes our methodology.
We used various techniques including deep learning.
# Results
Our results show significant improvements.
The accuracy increased by 15% compared to baseline.
"""
chunks = recursive_chunk(document, chunk_size=200, overlap=50)
for i, chunk in enumerate(chunks):
print(f"\n--- Chunk {i+1} ({len(chunk)} chars) ---")
print(chunk[:100] + "..." if len(chunk) > 100 else chunk)
Semantic Chunking
Split based on meaning, not just characters:
The idea: turn each sentence into an embedding (Day 19), then walk the sentences comparing each to the one before it using cosine similarity (Day 20). When two neighbors point in very DIFFERENT directions — low similarity — the topic likely just changed, so we start a new chunk there.
similarity_threshold is the cutoff for "still the same topic": higher (e.g. 0.85) splits more aggressively into many small chunks; lower (e.g. 0.7) keeps more sentences together. Start around 0.75-0.8 and tune by eyeballing the output.
# script_id: day_025_text_chunking/semantic_chunking
import re
from openai import OpenAI
import numpy as np
client = OpenAI()
def semantic_chunk(
text: str,
max_chunk_size: int = 1000,
similarity_threshold: float = 0.8
) -> list[str]:
"""
Split text into semantically coherent chunks.
Splits when content topic changes significantly.
"""
# First split into sentences
sentences = re.split(r'(?<=[.!?])\s+', text)
if len(sentences) <= 1:
return [text] if text.strip() else []
# Get embeddings for all sentences
response = client.embeddings.create(
model="text-embedding-3-small",
input=sentences
)
embeddings = [d.embedding for d in sorted(response.data, key=lambda x: x.index)]
# Find semantic breakpoints
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Compare with previous sentence
# cosine similarity (Day 20): how aligned these two sentence-vectors are
# (1.0 = same direction/same topic, near 0 = unrelated).
similarity = np.dot(embeddings[i], embeddings[i-1]) / (
np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1])
)
current_text = " ".join(current_chunk)
# Start new chunk if:
# 1. Topic changed (low similarity) OR
# 2. Current chunk is too large
if similarity < similarity_threshold or len(current_text) > max_chunk_size:
if current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
# Add last chunk
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
# Usage
text = """
Machine learning is a subset of AI. It uses algorithms to learn from data.
Deep learning uses neural networks with many layers.
The weather today is sunny. It's a great day for a picnic.
We should pack sandwiches and drinks.
Back to technology, Python is popular for ML.
It has many libraries like TensorFlow and PyTorch.
"""
chunks = semantic_chunk(text, similarity_threshold=0.75)
for i, chunk in enumerate(chunks):
print(f"\n--- Semantic Chunk {i+1} ---")
print(chunk)
Markdown-Aware Chunking
Respect document structure:
# script_id: day_025_text_chunking/markdown_aware_chunking
import re
from dataclasses import dataclass
@dataclass
class Chunk:
content: str
metadata: dict
def chunk_markdown(
text: str,
chunk_size: int = 1000
) -> list[Chunk]:
"""Chunk markdown while preserving headers and structure."""
# Split by headers
pattern = r'^(#{1,6})\s+(.+)$'
lines = text.split('\n')
chunks = []
current_chunk = []
current_headers = {}
for line in lines:
header_match = re.match(pattern, line)
if header_match:
# Save current chunk if exists
if current_chunk:
content = '\n'.join(current_chunk)
if content.strip():
chunks.append(Chunk(
content=content.strip(),
metadata=current_headers.copy()
))
current_chunk = []
# Update header context
level = len(header_match.group(1))
header_text = header_match.group(2)
# Clear lower level headers
current_headers = {k: v for k, v in current_headers.items() if k < level}
current_headers[level] = header_text
current_chunk.append(line)
# Check chunk size
current_text = '\n'.join(current_chunk)
if len(current_text) > chunk_size:
# Split current chunk
if len(current_chunk) > 1:
# Keep last line for next chunk
content = '\n'.join(current_chunk[:-1])
if content.strip():
chunks.append(Chunk(
content=content.strip(),
metadata=current_headers.copy()
))
current_chunk = [current_chunk[-1]]
# Don't forget last chunk
if current_chunk:
content = '\n'.join(current_chunk)
if content.strip():
chunks.append(Chunk(
content=content.strip(),
metadata=current_headers.copy()
))
return chunks
# Usage
markdown_doc = """
# Chapter 1: Introduction
This is the introduction section.
It provides an overview of the topic.
## 1.1 Background
The background section explains the history.
Many researchers have studied this area.
## 1.2 Motivation
Why is this work important?
There are several key reasons.
# Chapter 2: Methods
Our methodology is described here.
## 2.1 Data Collection
We collected data from various sources.
"""
chunks = chunk_markdown(markdown_doc, chunk_size=200)
for i, chunk in enumerate(chunks):
print(f"\n--- Chunk {i+1} ---")
print(f"Headers: {chunk.metadata}")
print(f"Content: {chunk.content[:100]}...")
Choosing Chunk Size
Rough rule of thumb: ~4 characters per token in English, so e.g. 500 tokens ≈ 2000 characters. The table below uses characters to match the code.
Chunk Size Guidelines
| Use Case | Chunk Size | Overlap | Reasoning |
|---|---|---|---|
| Precise Q&A | 200-400 chars | 50 | Specific retrieval |
| General search | 500-800 chars | 100 | Balance precision/context |
| Summarization | 1000-1500 chars | 200 | Need full context |
| Code | 50-100 lines | 10 lines | Preserve functions |
Complete Chunking Pipeline
# script_id: day_025_text_chunking/complete_chunking_pipeline
from dataclasses import dataclass
from typing import List, Optional
import re
@dataclass
class DocumentChunk:
content: str
source: str
chunk_index: int
total_chunks: int
metadata: dict
class DocumentChunker:
"""Flexible document chunking with multiple strategies."""
def __init__(
self,
chunk_size: int = 500,
overlap: int = 50,
strategy: str = "recursive" # "fixed", "recursive", "sentence"
):
self.chunk_size = chunk_size
self.overlap = overlap
self.strategy = strategy
def chunk(self, text: str, source: str = "unknown", metadata: dict = None) -> List[DocumentChunk]:
"""Chunk a document using the configured strategy."""
if self.strategy == "fixed":
raw_chunks = self._fixed_chunk(text)
elif self.strategy == "recursive":
raw_chunks = self._recursive_chunk(text)
elif self.strategy == "sentence":
raw_chunks = self._sentence_chunk(text)
else:
raise ValueError(f"Unknown strategy: {self.strategy}")
# Wrap in DocumentChunk objects
return [
DocumentChunk(
content=chunk,
source=source,
chunk_index=i,
total_chunks=len(raw_chunks),
metadata=metadata or {}
)
for i, chunk in enumerate(raw_chunks)
]
def _fixed_chunk(self, text: str) -> List[str]:
"""Simple fixed-size chunking."""
chunks = []
start = 0
while start < len(text):
end = start + self.chunk_size
chunks.append(text[start:end])
start = end - self.overlap
return [c.strip() for c in chunks if c.strip()]
def _recursive_chunk(self, text: str) -> List[str]:
"""Recursive chunking with smart separators."""
separators = ["\n\n", "\n", ". ", ", ", " "]
return self._split_recursive(text, separators)
def _split_recursive(self, text: str, separators: List[str]) -> List[str]:
if len(text) <= self.chunk_size:
return [text] if text.strip() else []
if not separators:
# Last resort: hard split
return self._fixed_chunk(text)
sep = separators[0]
splits = text.split(sep)
chunks = []
current = ""
for split in splits:
test = (current + sep + split) if current else split
if len(test) <= self.chunk_size:
current = test
else:
if current:
chunks.append(current)
if len(split) > self.chunk_size:
chunks.extend(self._split_recursive(split, separators[1:]))
current = ""
else:
current = split
if current:
chunks.append(current)
return [c.strip() for c in chunks if c.strip()]
def _sentence_chunk(self, text: str) -> List[str]:
"""Sentence-based chunking."""
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current = ""
for sentence in sentences:
test = (current + " " + sentence) if current else sentence
if len(test) <= self.chunk_size:
current = test
else:
if current:
chunks.append(current)
current = sentence
if current:
chunks.append(current)
return [c.strip() for c in chunks if c.strip()]
# Usage
chunker = DocumentChunker(chunk_size=300, overlap=50, strategy="recursive")
document = """
Machine learning is transforming how we build software.
Traditional programming requires explicit rules. Machine learning learns from data instead.
There are three main types of machine learning:
1. Supervised learning uses labeled data
2. Unsupervised learning finds patterns in unlabeled data
3. Reinforcement learning learns through trial and error
"""
chunks = chunker.chunk(document, source="ml_intro.txt", metadata={"topic": "ml"})
print(f"Created {len(chunks)} chunks:\n")
for chunk in chunks:
print(f"--- Chunk {chunk.chunk_index + 1}/{chunk.total_chunks} ---")
print(f"Content: {chunk.content}")
print()
Summary
Quick Reference
# script_id: day_025_text_chunking/quick_reference
# Simple fixed-size chunks
def chunk_simple(text, size=500):
return [text[i:i+size] for i in range(0, len(text), size)]
# Sentence-aware chunks
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
# Paragraph-aware chunks
paragraphs = text.split('\n\n')
Exercises
- Add overlap to your fixed-size chunker. Extend
chunk_simpleto accept anoverlapparameter so consecutive chunks share the last N characters. Verify a sentence that straddles a boundary now appears in both chunks. - Measure the size/recall tradeoff. Chunk the same document at 200, 500, and 1000 characters, embed each set, and run one query against all three. Note how chunk size changes which passage ranks first.
- Write a structure-aware splitter. Split a Markdown doc on headings (
#,##) so each chunk is one section, then fall back to size-based splitting only for sections that exceed your limit. - Attach metadata to every chunk. Return chunks as
{"text": ..., "source": ..., "chunk_index": ...}so you can trace a retrieved chunk back to its document and position.
Solutions (approaches)
- Step the range by
size - overlapinstead ofsize:range(0, len(text), size - overlap). Guard againstoverlap >= size. - Build three collections (or tag chunks with their size), embed once per set, and compare top-1 results. Smaller chunks = sharper matches but less surrounding context.
- Split on a heading regex, then post-process: any section longer than the limit goes through your recursive/fixed splitter. Keep the heading text as a prefix for context.
- Have the chunker
enumerate(chunks)and emit dicts; downstreamcollection.addpasses the extra fields viametadatas=.
Checkpoint
Run DocumentChunker(strategy="fixed", overlap=50) on a long document and confirm: you get multiple chunks, each carries its chunk_index/total_chunks, and consecutive chunks share the overlap you configured (the tail of one reappears at the head of the next). If chunks have no overlap, check that your overlap parameter is actually being subtracted from the stride — a stride equal to chunk size means zero overlap and lost context at the seams.
What's Next?
Now that you can chunk documents effectively, let's learn how to Inject Retrieved Context into Prompts - the final piece of the RAG puzzle!