akashnotes — Structured Learning for Engineers

You've parsed your documents into text. But embedding entire documents (turning the whole text into one list-of-numbers vector, as in Day 19) isn't effective — one vector can't represent a 10,000-word doc that spans many topics. The solution? Chunking - splitting text into smaller, meaningful pieces.

Coming from Software Engineering? Text chunking is like pagination or sharding — you're breaking large data into smaller, manageable pieces. If you've implemented paginated APIs or database sharding strategies, the tradeoffs are similar: chunk too small and you lose context, chunk too large and you waste resources.

Why Chunk?

Chunking Benefits

Aspect	Without Chunks	With Chunks
Search precision	Low	High
Context relevance	Mixed	Focused
Token usage	High	Optimal
Cost	Higher	Lower

Basic Chunking: Fixed Size

The simplest approach - split every N characters:

# script_id: day_025_text_chunking/fixed_size_chunking
def chunk_by_characters(text: str, chunk_size: int = 1000, overlap: int = 200) -> list[str]:
    """Split text into fixed-size chunks with overlap."""
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap  # Overlap to maintain context

    return chunks

# Usage
text = "Your very long document text here..." * 100
chunks = chunk_by_characters(text, chunk_size=500, overlap=50)
print(f"Created {len(chunks)} chunks")

Overlap means each chunk repeats the last N characters of the previous one. Without it, a sentence split across a boundary is cut in half in both chunks, and a query matching that sentence may miss both. With overlap, the straddling text lives fully in at least one chunk.

Problems with Fixed Size

Fixed size can split mid-sentence or mid-word!

Smart Chunking: Recursive Character Splitter

Split at natural boundaries (paragraphs, sentences, words):

# script_id: day_025_text_chunking/recursive_character_splitter
import re

def recursive_chunk(
    text: str,
    chunk_size: int = 1000,
    overlap: int = 200,
    separators: list[str] = None
) -> list[str]:
    """
    Recursively split text using multiple separators.
    Tries to split at paragraph > sentence > word boundaries.
    """
    if separators is None:
        separators = ["\n\n", "\n", ". ", " ", ""]

    # Note: overlap is accepted for API symmetry but not applied in this
    # recursive splitter; see the fixed-size chunker above for an overlap implementation.
    chunks = []
    current_sep = separators[0]
    remaining_seps = separators[1:]

    # Base case: text fits in chunk
    if len(text) <= chunk_size:
        return [text.strip()] if text.strip() else []

    # Try to split with current separator
    splits = text.split(current_sep)

    current_chunk = ""
    for split in splits:
        test_chunk = current_chunk + current_sep + split if current_chunk else split

        if len(test_chunk) <= chunk_size:
            current_chunk = test_chunk
        else:
            # Save current chunk if valid
            if current_chunk:
                chunks.append(current_chunk.strip())

            # Handle split that's too large
            if len(split) > chunk_size and remaining_seps:
                # Recursively split with smaller separator
                sub_chunks = recursive_chunk(split, chunk_size, overlap, remaining_seps)
                chunks.extend(sub_chunks)
                current_chunk = ""
            else:
                current_chunk = split

    # Don't forget the last chunk
    if current_chunk:
        chunks.append(current_chunk.strip())

    return [c for c in chunks if c]  # Remove empty chunks

# Usage
document = """
# Introduction

This is the first paragraph about machine learning.
It contains important information about AI.

# Methods

The second section describes our methodology.
We used various techniques including deep learning.

# Results

Our results show significant improvements.
The accuracy increased by 15% compared to baseline.
"""

chunks = recursive_chunk(document, chunk_size=200, overlap=50)
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ({len(chunk)} chars) ---")
    print(chunk[:100] + "..." if len(chunk) > 100 else chunk)

Semantic Chunking

Split based on meaning, not just characters:

The idea: turn each sentence into an embedding (Day 19), then walk the sentences comparing each to the one before it using cosine similarity (Day 20). When two neighbors point in very DIFFERENT directions — low similarity — the topic likely just changed, so we start a new chunk there.

similarity_threshold is the cutoff for "still the same topic": higher (e.g. 0.85) splits more aggressively into many small chunks; lower (e.g. 0.7) keeps more sentences together. Start around 0.75-0.8 and tune by eyeballing the output.

# script_id: day_025_text_chunking/semantic_chunking
import re
from openai import OpenAI
import numpy as np

client = OpenAI()

def semantic_chunk(
    text: str,
    max_chunk_size: int = 1000,
    similarity_threshold: float = 0.8
) -> list[str]:
    """
    Split text into semantically coherent chunks.
    Splits when content topic changes significantly.
    """
    # First split into sentences
    sentences = re.split(r'(?<=[.!?])\s+', text)

    if len(sentences) <= 1:
        return [text] if text.strip() else []

    # Get embeddings for all sentences
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=sentences
    )
    embeddings = [d.embedding for d in sorted(response.data, key=lambda x: x.index)]

    # Find semantic breakpoints
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        # Compare with previous sentence
        # cosine similarity (Day 20): how aligned these two sentence-vectors are
        # (1.0 = same direction/same topic, near 0 = unrelated).
        similarity = np.dot(embeddings[i], embeddings[i-1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1])
        )

        current_text = " ".join(current_chunk)

        # Start new chunk if:
        # 1. Topic changed (low similarity) OR
        # 2. Current chunk is too large
        if similarity < similarity_threshold or len(current_text) > max_chunk_size:
            if current_chunk:
                chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    # Add last chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

# Usage
text = """
Machine learning is a subset of AI. It uses algorithms to learn from data.
Deep learning uses neural networks with many layers.

The weather today is sunny. It's a great day for a picnic.
We should pack sandwiches and drinks.

Back to technology, Python is popular for ML.
It has many libraries like TensorFlow and PyTorch.
"""

chunks = semantic_chunk(text, similarity_threshold=0.75)
for i, chunk in enumerate(chunks):
    print(f"\n--- Semantic Chunk {i+1} ---")
    print(chunk)

Markdown-Aware Chunking

Respect document structure:

# script_id: day_025_text_chunking/markdown_aware_chunking
import re
from dataclasses import dataclass

@dataclass
class Chunk:
    content: str
    metadata: dict

def chunk_markdown(
    text: str,
    chunk_size: int = 1000
) -> list[Chunk]:
    """Chunk markdown while preserving headers and structure."""

    # Split by headers
    pattern = r'^(#{1,6})\s+(.+)$'
    lines = text.split('\n')

    chunks = []
    current_chunk = []
    current_headers = {}

    for line in lines:
        header_match = re.match(pattern, line)

        if header_match:
            # Save current chunk if exists
            if current_chunk:
                content = '\n'.join(current_chunk)
                if content.strip():
                    chunks.append(Chunk(
                        content=content.strip(),
                        metadata=current_headers.copy()
                    ))
                current_chunk = []

            # Update header context
            level = len(header_match.group(1))
            header_text = header_match.group(2)

            # Clear lower level headers
            current_headers = {k: v for k, v in current_headers.items() if k < level}
            current_headers[level] = header_text

        current_chunk.append(line)

        # Check chunk size
        current_text = '\n'.join(current_chunk)
        if len(current_text) > chunk_size:
            # Split current chunk
            if len(current_chunk) > 1:
                # Keep last line for next chunk
                content = '\n'.join(current_chunk[:-1])
                if content.strip():
                    chunks.append(Chunk(
                        content=content.strip(),
                        metadata=current_headers.copy()
                    ))
                current_chunk = [current_chunk[-1]]

    # Don't forget last chunk
    if current_chunk:
        content = '\n'.join(current_chunk)
        if content.strip():
            chunks.append(Chunk(
                content=content.strip(),
                metadata=current_headers.copy()
            ))

    return chunks

# Usage
markdown_doc = """
# Chapter 1: Introduction

This is the introduction section.
It provides an overview of the topic.

## 1.1 Background

The background section explains the history.
Many researchers have studied this area.

## 1.2 Motivation

Why is this work important?
There are several key reasons.

# Chapter 2: Methods

Our methodology is described here.

## 2.1 Data Collection

We collected data from various sources.
"""

chunks = chunk_markdown(markdown_doc, chunk_size=200)
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---")
    print(f"Headers: {chunk.metadata}")
    print(f"Content: {chunk.content[:100]}...")

Choosing Chunk Size

Rough rule of thumb: ~4 characters per token in English, so e.g. 500 tokens ≈ 2000 characters. The table below uses characters to match the code.

Chunk Size Guidelines

Use Case	Chunk Size	Overlap	Reasoning
Precise Q&A	200-400 chars	50	Specific retrieval
General search	500-800 chars	100	Balance precision/context
Summarization	1000-1500 chars	200	Need full context
Code	50-100 lines	10 lines	Preserve functions

Complete Chunking Pipeline

# script_id: day_025_text_chunking/complete_chunking_pipeline
from dataclasses import dataclass
from typing import List, Optional
import re

@dataclass
class DocumentChunk:
    content: str
    source: str
    chunk_index: int
    total_chunks: int
    metadata: dict

class DocumentChunker:
    """Flexible document chunking with multiple strategies."""

    def __init__(
        self,
        chunk_size: int = 500,
        overlap: int = 50,
        strategy: str = "recursive"  # "fixed", "recursive", "sentence"
    ):
        self.chunk_size = chunk_size
        self.overlap = overlap
        self.strategy = strategy

    def chunk(self, text: str, source: str = "unknown", metadata: dict = None) -> List[DocumentChunk]:
        """Chunk a document using the configured strategy."""

        if self.strategy == "fixed":
            raw_chunks = self._fixed_chunk(text)
        elif self.strategy == "recursive":
            raw_chunks = self._recursive_chunk(text)
        elif self.strategy == "sentence":
            raw_chunks = self._sentence_chunk(text)
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")

        # Wrap in DocumentChunk objects
        return [
            DocumentChunk(
                content=chunk,
                source=source,
                chunk_index=i,
                total_chunks=len(raw_chunks),
                metadata=metadata or {}
            )
            for i, chunk in enumerate(raw_chunks)
        ]

    def _fixed_chunk(self, text: str) -> List[str]:
        """Simple fixed-size chunking."""
        chunks = []
        start = 0
        while start < len(text):
            end = start + self.chunk_size
            chunks.append(text[start:end])
            start = end - self.overlap
        return [c.strip() for c in chunks if c.strip()]

    def _recursive_chunk(self, text: str) -> List[str]:
        """Recursive chunking with smart separators."""
        separators = ["\n\n", "\n", ". ", ", ", " "]
        return self._split_recursive(text, separators)

    def _split_recursive(self, text: str, separators: List[str]) -> List[str]:
        if len(text) <= self.chunk_size:
            return [text] if text.strip() else []

        if not separators:
            # Last resort: hard split
            return self._fixed_chunk(text)

        sep = separators[0]
        splits = text.split(sep)

        chunks = []
        current = ""

        for split in splits:
            test = (current + sep + split) if current else split

            if len(test) <= self.chunk_size:
                current = test
            else:
                if current:
                    chunks.append(current)
                if len(split) > self.chunk_size:
                    chunks.extend(self._split_recursive(split, separators[1:]))
                    current = ""
                else:
                    current = split

        if current:
            chunks.append(current)

        return [c.strip() for c in chunks if c.strip()]

    def _sentence_chunk(self, text: str) -> List[str]:
        """Sentence-based chunking."""
        sentences = re.split(r'(?<=[.!?])\s+', text)

        chunks = []
        current = ""

        for sentence in sentences:
            test = (current + " " + sentence) if current else sentence

            if len(test) <= self.chunk_size:
                current = test
            else:
                if current:
                    chunks.append(current)
                current = sentence

        if current:
            chunks.append(current)

        return [c.strip() for c in chunks if c.strip()]

# Usage
chunker = DocumentChunker(chunk_size=300, overlap=50, strategy="recursive")

document = """
Machine learning is transforming how we build software.

Traditional programming requires explicit rules. Machine learning learns from data instead.

There are three main types of machine learning:
1. Supervised learning uses labeled data
2. Unsupervised learning finds patterns in unlabeled data
3. Reinforcement learning learns through trial and error
"""

chunks = chunker.chunk(document, source="ml_intro.txt", metadata={"topic": "ml"})

print(f"Created {len(chunks)} chunks:\n")
for chunk in chunks:
    print(f"--- Chunk {chunk.chunk_index + 1}/{chunk.total_chunks} ---")
    print(f"Content: {chunk.content}")
    print()

Summary

Quick Reference

# script_id: day_025_text_chunking/quick_reference
# Simple fixed-size chunks
def chunk_simple(text, size=500):
    return [text[i:i+size] for i in range(0, len(text), size)]

# Sentence-aware chunks
import re
sentences = re.split(r'(?<=[.!?])\s+', text)

# Paragraph-aware chunks
paragraphs = text.split('\n\n')

Exercises

Add overlap to your fixed-size chunker. Extend chunk_simple to accept an overlap parameter so consecutive chunks share the last N characters. Verify a sentence that straddles a boundary now appears in both chunks.
Measure the size/recall tradeoff. Chunk the same document at 200, 500, and 1000 characters, embed each set, and run one query against all three. Note how chunk size changes which passage ranks first.
Write a structure-aware splitter. Split a Markdown doc on headings (#, ##) so each chunk is one section, then fall back to size-based splitting only for sections that exceed your limit.
Attach metadata to every chunk. Return chunks as {"text": ..., "source": ..., "chunk_index": ...} so you can trace a retrieved chunk back to its document and position.

Solutions (approaches)

Step the range by size - overlap instead of size: range(0, len(text), size - overlap). Guard against overlap >= size.
Build three collections (or tag chunks with their size), embed once per set, and compare top-1 results. Smaller chunks = sharper matches but less surrounding context.
Split on a heading regex, then post-process: any section longer than the limit goes through your recursive/fixed splitter. Keep the heading text as a prefix for context.
Have the chunker enumerate(chunks) and emit dicts; downstream collection.add passes the extra fields via metadatas=.

Checkpoint

Run DocumentChunker(strategy="fixed", overlap=50) on a long document and confirm: you get multiple chunks, each carries its chunk_index/total_chunks, and consecutive chunks share the overlap you configured (the tail of one reappears at the head of the next). If chunks have no overlap, check that your overlap parameter is actually being subtracted from the stride — a stride equal to chunk size means zero overlap and lost context at the seams.

What's Next?

Now that you can chunk documents effectively, let's learn how to Inject Retrieved Context into Prompts - the final piece of the RAG puzzle!

Text Chunking Strategies