Books/RAG Essentials/Chunking Strategies

Chunking Strategies

How you split your documents into chunks is one of the most important decisions in a RAG pipeline. Bad chunking leads to bad retrieval, and bad retrieval leads to bad answers. Let's explore the strategies and learn when to use each one.

Why Chunking Matters

Consider a 50-page company handbook. You cannot embed the entire thing as one vector — it would be too large and too vague. You also cannot embed each word individually — that loses all context. Chunking is about finding the right granularity.

The impact of chunk size:

Chunk Size	Pros	Cons
Too small (50-100 chars)	Very precise matching	Loses context, fragments sentences
Too large (5000+ chars)	Full context preserved	Vague matching, may exceed model limits
Just right (200-1000 chars)	Good balance of precision and context	Requires tuning per document type

Strategy 1: Fixed-Size Chunking

The simplest approach — split text every N characters (or tokens) with optional overlap.

def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into fixed-size chunks with overlap."""
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

# Example
chunks = fixed_size_chunk(document_text, chunk_size=500, overlap=50)

When to use: Quick prototyping, uniform text (like chat logs), when you need simplicity.

Drawbacks: Cuts sentences and paragraphs in the middle. A sentence about vacation policy might end up split across two chunks, making neither chunk useful on its own.

Strategy 2: Recursive Text Splitting

Splits on natural boundaries — paragraphs first, then sentences, then words. This keeps content semantically coherent.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=[
        "\n\n",    # First try: split on double newlines (paragraphs)
        "\n",      # Then: single newlines
        ". ",      # Then: sentences
        ", ",      # Then: clauses
        " ",       # Then: words
        ""         # Finally: characters
    ]
)

chunks = splitter.split_text(document_text)

The splitter tries each separator in order. If splitting on paragraphs produces chunks within the size limit, it stops there. If a paragraph is too large, it falls back to splitting on sentences, and so on.

When to use: Most use cases. This is the default recommendation for general text documents.

Strategy 3: Semantic Chunking

Instead of splitting by size or characters, split based on meaning. When the topic changes, create a new chunk.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75
)

chunks = semantic_splitter.split_text(document_text)

How it works:

Split the text into sentences
Embed each sentence
Compare consecutive sentence embeddings
When similarity drops below a threshold, start a new chunk

When to use: Documents with multiple topics, transcripts, long-form articles where topics shift naturally.

Drawbacks: Slower (requires embedding every sentence), chunk sizes are unpredictable.

Strategy 4: Document-Structure-Based Chunking

Use the document's own structure — headers, sections, pages — as chunk boundaries.

import re

def chunk_by_headers(markdown_text: str) -> list[dict]:
    """Split a Markdown document by headers."""
    sections = re.split(r"\n(#{1,3} .+)\n", markdown_text)

    chunks = []
    current_header = "Introduction"

    for i, section in enumerate(sections):
        if re.match(r"#{1,3} .+", section):
            current_header = section.strip("# ").strip()
        elif section.strip():
            chunks.append({
                "content": section.strip(),
                "header": current_header
            })

    return chunks

# For HTML documents
from langchain.text_splitter import HTMLSectionSplitter

html_splitter = HTMLSectionSplitter(
    headers_to_split_on=[
        ("h1", "Header 1"),
        ("h2", "Header 2"),
        ("h3", "Header 3"),
    ]
)
chunks = html_splitter.split_text(html_content)

When to use: Structured documents like docs sites, wikis, markdown files, HTML pages.

Overlap: Why It Matters

Overlap means the end of one chunk appears at the start of the next chunk. This prevents information loss at chunk boundaries.

Without overlap (information lost at boundary):
  Chunk 1: "...employees receive 15 days of paid"
  Chunk 2: "vacation per year. Unused days can be..."

With 50-char overlap (boundary information preserved):
  Chunk 1: "...employees receive 15 days of paid vacation per year."
  Chunk 2: "15 days of paid vacation per year. Unused days can be..."

A good rule of thumb: 10-20% overlap relative to chunk size.

Chunk Size	Recommended Overlap
200 chars	20-40 chars
500 chars	50-100 chars
1000 chars	100-200 chars

Chunk Size Considerations

The right chunk size depends on your use case:

# Experimentation template — try different sizes and measure retrieval quality
chunk_sizes = [200, 500, 800, 1200]

for size in chunk_sizes:
    splitter = RecursiveCharacterTextSplitter(chunk_size=size, chunk_overlap=size // 10)
    chunks = splitter.split_text(document_text)
    print(f"Chunk size {size}: {len(chunks)} chunks, avg {sum(len(c) for c in chunks) // len(chunks)} chars")

Guidelines:

Document Type	Recommended Chunk Size	Why
FAQ / Short answers	200-300 chars	Each Q&A is self-contained
Technical docs	500-800 chars	Need enough context for code/explanations
Legal documents	800-1200 chars	Clauses reference surrounding context
Chat logs	300-500 chars	Conversations are relatively short
Books / articles	500-1000 chars	Paragraphs are natural units

Best Practices

Always include metadata — Store the source file, page number, section header, and chunk index with each chunk
Preserve context — Add the section header or document title to each chunk so it makes sense in isolation
Test with real queries — The best chunk size is the one that retrieves the right information for your actual questions
Use token-based splitting for LLMs — Characters and tokens are not the same. "Hello world" is 2 tokens but 11 characters. Use tiktoken for accurate token counting
Consider your embedding model's limit — Most models handle 512 tokens well; some support up to 8192

import tiktoken

encoder = tiktoken.encoding_for_model("text-embedding-3-small")

def count_tokens(text: str) -> int:
    return len(encoder.encode(text))

# Ensure chunks are within token limits
for chunk in chunks:
    tokens = count_tokens(chunk)
    if tokens > 512:
        print(f"Warning: chunk has {tokens} tokens (recommended max: 512)")

Enrich chunks with context — Prepend parent information so each chunk stands alone:

enriched_chunks = []
for chunk in chunks:
    enriched = f"Document: {doc_title}\nSection: {chunk['header']}\n\n{chunk['content']}"
    enriched_chunks.append(enriched)

What to ask your AI: "I have [document type] documents averaging [size]. Recommend the best chunking strategy, chunk size, and overlap for my RAG pipeline."

What's Next?

Your documents are chunked and indexed. But how do you know if your RAG system is actually returning good results? Next, we cover evaluating RAG quality — metrics, testing strategies, and common failure modes.

🌐 www.genai-mentor.ai

Building a RAG Pipeline

Evaluating RAG Quality