Books/RAG Essentials/Chunking Strategies

    Chunking Strategies

    Chunking Strategies

    How you split your documents into chunks is one of the most important decisions in a RAG pipeline. Bad chunking leads to bad retrieval, and bad retrieval leads to bad answers. Let's explore the strategies and learn when to use each one.

    Why Chunking Matters

    Consider a 50-page company handbook. You cannot embed the entire thing as one vector — it would be too large and too vague. You also cannot embed each word individually — that loses all context. Chunking is about finding the right granularity.

    The impact of chunk size:

    Chunk SizeProsCons
    Too small (50-100 chars)Very precise matchingLoses context, fragments sentences
    Too large (5000+ chars)Full context preservedVague matching, may exceed model limits
    Just right (200-1000 chars)Good balance of precision and contextRequires tuning per document type

    Strategy 1: Fixed-Size Chunking

    The simplest approach — split text every N characters (or tokens) with optional overlap.

    def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
        """Split text into fixed-size chunks with overlap."""
        chunks = []
        start = 0
        while start < len(text):
            end = min(start + chunk_size, len(text))
            chunks.append(text[start:end])
            start += chunk_size - overlap
        return chunks
    
    # Example
    chunks = fixed_size_chunk(document_text, chunk_size=500, overlap=50)

    When to use: Quick prototyping, uniform text (like chat logs), when you need simplicity.

    Drawbacks: Cuts sentences and paragraphs in the middle. A sentence about vacation policy might end up split across two chunks, making neither chunk useful on its own.

    Strategy 2: Recursive Text Splitting

    Splits on natural boundaries — paragraphs first, then sentences, then words. This keeps content semantically coherent.

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        separators=[
            "\n\n",    # First try: split on double newlines (paragraphs)
            "\n",      # Then: single newlines
            ". ",      # Then: sentences
            ", ",      # Then: clauses
            " ",       # Then: words
            ""         # Finally: characters
        ]
    )
    
    chunks = splitter.split_text(document_text)

    The splitter tries each separator in order. If splitting on paragraphs produces chunks within the size limit, it stops there. If a paragraph is too large, it falls back to splitting on sentences, and so on.

    When to use: Most use cases. This is the default recommendation for general text documents.

    Strategy 3: Semantic Chunking

    Instead of splitting by size or characters, split based on meaning. When the topic changes, create a new chunk.

    from langchain_experimental.text_splitter import SemanticChunker
    from langchain_openai import OpenAIEmbeddings
    
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    
    semantic_splitter = SemanticChunker(
        embeddings,
        breakpoint_threshold_type="percentile",
        breakpoint_threshold_amount=75
    )
    
    chunks = semantic_splitter.split_text(document_text)

    How it works:

    1. Split the text into sentences
    2. Embed each sentence
    3. Compare consecutive sentence embeddings
    4. When similarity drops below a threshold, start a new chunk

    When to use: Documents with multiple topics, transcripts, long-form articles where topics shift naturally.

    Drawbacks: Slower (requires embedding every sentence), chunk sizes are unpredictable.

    Strategy 4: Document-Structure-Based Chunking

    Use the document's own structure — headers, sections, pages — as chunk boundaries.

    import re
    
    def chunk_by_headers(markdown_text: str) -> list[dict]:
        """Split a Markdown document by headers."""
        sections = re.split(r"\n(#{1,3} .+)\n", markdown_text)
    
        chunks = []
        current_header = "Introduction"
    
        for i, section in enumerate(sections):
            if re.match(r"#{1,3} .+", section):
                current_header = section.strip("# ").strip()
            elif section.strip():
                chunks.append({
                    "content": section.strip(),
                    "header": current_header
                })
    
        return chunks
    
    # For HTML documents
    from langchain.text_splitter import HTMLSectionSplitter
    
    html_splitter = HTMLSectionSplitter(
        headers_to_split_on=[
            ("h1", "Header 1"),
            ("h2", "Header 2"),
            ("h3", "Header 3"),
        ]
    )
    chunks = html_splitter.split_text(html_content)

    When to use: Structured documents like docs sites, wikis, markdown files, HTML pages.

    Overlap: Why It Matters

    Overlap means the end of one chunk appears at the start of the next chunk. This prevents information loss at chunk boundaries.

    Without overlap (information lost at boundary):
      Chunk 1: "...employees receive 15 days of paid"
      Chunk 2: "vacation per year. Unused days can be..."
    
    With 50-char overlap (boundary information preserved):
      Chunk 1: "...employees receive 15 days of paid vacation per year."
      Chunk 2: "15 days of paid vacation per year. Unused days can be..."
    

    A good rule of thumb: 10-20% overlap relative to chunk size.

    Chunk SizeRecommended Overlap
    200 chars20-40 chars
    500 chars50-100 chars
    1000 chars100-200 chars

    Chunk Size Considerations

    The right chunk size depends on your use case:

    # Experimentation template — try different sizes and measure retrieval quality
    chunk_sizes = [200, 500, 800, 1200]
    
    for size in chunk_sizes:
        splitter = RecursiveCharacterTextSplitter(chunk_size=size, chunk_overlap=size // 10)
        chunks = splitter.split_text(document_text)
        print(f"Chunk size {size}: {len(chunks)} chunks, avg {sum(len(c) for c in chunks) // len(chunks)} chars")

    Guidelines:

    Document TypeRecommended Chunk SizeWhy
    FAQ / Short answers200-300 charsEach Q&A is self-contained
    Technical docs500-800 charsNeed enough context for code/explanations
    Legal documents800-1200 charsClauses reference surrounding context
    Chat logs300-500 charsConversations are relatively short
    Books / articles500-1000 charsParagraphs are natural units

    Best Practices

    1. Always include metadata — Store the source file, page number, section header, and chunk index with each chunk
    2. Preserve context — Add the section header or document title to each chunk so it makes sense in isolation
    3. Test with real queries — The best chunk size is the one that retrieves the right information for your actual questions
    4. Use token-based splitting for LLMs — Characters and tokens are not the same. "Hello world" is 2 tokens but 11 characters. Use tiktoken for accurate token counting
    5. Consider your embedding model's limit — Most models handle 512 tokens well; some support up to 8192
    import tiktoken
    
    encoder = tiktoken.encoding_for_model("text-embedding-3-small")
    
    def count_tokens(text: str) -> int:
        return len(encoder.encode(text))
    
    # Ensure chunks are within token limits
    for chunk in chunks:
        tokens = count_tokens(chunk)
        if tokens > 512:
            print(f"Warning: chunk has {tokens} tokens (recommended max: 512)")
    1. Enrich chunks with context — Prepend parent information so each chunk stands alone:
    enriched_chunks = []
    for chunk in chunks:
        enriched = f"Document: {doc_title}\nSection: {chunk['header']}\n\n{chunk['content']}"
        enriched_chunks.append(enriched)

    What to ask your AI: "I have [document type] documents averaging [size]. Recommend the best chunking strategy, chunk size, and overlap for my RAG pipeline."

    What's Next?

    Your documents are chunked and indexed. But how do you know if your RAG system is actually returning good results? Next, we cover evaluating RAG quality — metrics, testing strategies, and common failure modes.


    🌐 www.genai-mentor.ai