Chunking Strategies
Chunking Strategies
How you split your documents into chunks is one of the most important decisions in a RAG pipeline. Bad chunking leads to bad retrieval, and bad retrieval leads to bad answers. Let's explore the strategies and learn when to use each one.
Why Chunking Matters
Consider a 50-page company handbook. You cannot embed the entire thing as one vector — it would be too large and too vague. You also cannot embed each word individually — that loses all context. Chunking is about finding the right granularity.
The impact of chunk size:
| Chunk Size | Pros | Cons |
|---|---|---|
| Too small (50-100 chars) | Very precise matching | Loses context, fragments sentences |
| Too large (5000+ chars) | Full context preserved | Vague matching, may exceed model limits |
| Just right (200-1000 chars) | Good balance of precision and context | Requires tuning per document type |
Strategy 1: Fixed-Size Chunking
The simplest approach — split text every N characters (or tokens) with optional overlap.
def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]: """Split text into fixed-size chunks with overlap.""" chunks = [] start = 0 while start < len(text): end = min(start + chunk_size, len(text)) chunks.append(text[start:end]) start += chunk_size - overlap return chunks # Example chunks = fixed_size_chunk(document_text, chunk_size=500, overlap=50)
When to use: Quick prototyping, uniform text (like chat logs), when you need simplicity.
Drawbacks: Cuts sentences and paragraphs in the middle. A sentence about vacation policy might end up split across two chunks, making neither chunk useful on its own.
Strategy 2: Recursive Text Splitting
Splits on natural boundaries — paragraphs first, then sentences, then words. This keeps content semantically coherent.
from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=[ "\n\n", # First try: split on double newlines (paragraphs) "\n", # Then: single newlines ". ", # Then: sentences ", ", # Then: clauses " ", # Then: words "" # Finally: characters ] ) chunks = splitter.split_text(document_text)
The splitter tries each separator in order. If splitting on paragraphs produces chunks within the size limit, it stops there. If a paragraph is too large, it falls back to splitting on sentences, and so on.
When to use: Most use cases. This is the default recommendation for general text documents.
Strategy 3: Semantic Chunking
Instead of splitting by size or characters, split based on meaning. When the topic changes, create a new chunk.
from langchain_experimental.text_splitter import SemanticChunker from langchain_openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings(model="text-embedding-3-small") semantic_splitter = SemanticChunker( embeddings, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=75 ) chunks = semantic_splitter.split_text(document_text)
How it works:
- Split the text into sentences
- Embed each sentence
- Compare consecutive sentence embeddings
- When similarity drops below a threshold, start a new chunk
When to use: Documents with multiple topics, transcripts, long-form articles where topics shift naturally.
Drawbacks: Slower (requires embedding every sentence), chunk sizes are unpredictable.
Strategy 4: Document-Structure-Based Chunking
Use the document's own structure — headers, sections, pages — as chunk boundaries.
import re def chunk_by_headers(markdown_text: str) -> list[dict]: """Split a Markdown document by headers.""" sections = re.split(r"\n(#{1,3} .+)\n", markdown_text) chunks = [] current_header = "Introduction" for i, section in enumerate(sections): if re.match(r"#{1,3} .+", section): current_header = section.strip("# ").strip() elif section.strip(): chunks.append({ "content": section.strip(), "header": current_header }) return chunks # For HTML documents from langchain.text_splitter import HTMLSectionSplitter html_splitter = HTMLSectionSplitter( headers_to_split_on=[ ("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ] ) chunks = html_splitter.split_text(html_content)
When to use: Structured documents like docs sites, wikis, markdown files, HTML pages.
Overlap: Why It Matters
Overlap means the end of one chunk appears at the start of the next chunk. This prevents information loss at chunk boundaries.
Without overlap (information lost at boundary):
Chunk 1: "...employees receive 15 days of paid"
Chunk 2: "vacation per year. Unused days can be..."
With 50-char overlap (boundary information preserved):
Chunk 1: "...employees receive 15 days of paid vacation per year."
Chunk 2: "15 days of paid vacation per year. Unused days can be..."
A good rule of thumb: 10-20% overlap relative to chunk size.
| Chunk Size | Recommended Overlap |
|---|---|
| 200 chars | 20-40 chars |
| 500 chars | 50-100 chars |
| 1000 chars | 100-200 chars |
Chunk Size Considerations
The right chunk size depends on your use case:
# Experimentation template — try different sizes and measure retrieval quality chunk_sizes = [200, 500, 800, 1200] for size in chunk_sizes: splitter = RecursiveCharacterTextSplitter(chunk_size=size, chunk_overlap=size // 10) chunks = splitter.split_text(document_text) print(f"Chunk size {size}: {len(chunks)} chunks, avg {sum(len(c) for c in chunks) // len(chunks)} chars")
Guidelines:
| Document Type | Recommended Chunk Size | Why |
|---|---|---|
| FAQ / Short answers | 200-300 chars | Each Q&A is self-contained |
| Technical docs | 500-800 chars | Need enough context for code/explanations |
| Legal documents | 800-1200 chars | Clauses reference surrounding context |
| Chat logs | 300-500 chars | Conversations are relatively short |
| Books / articles | 500-1000 chars | Paragraphs are natural units |
Best Practices
- Always include metadata — Store the source file, page number, section header, and chunk index with each chunk
- Preserve context — Add the section header or document title to each chunk so it makes sense in isolation
- Test with real queries — The best chunk size is the one that retrieves the right information for your actual questions
- Use token-based splitting for LLMs — Characters and tokens are not the same. "Hello world" is 2 tokens but 11 characters. Use
tiktokenfor accurate token counting - Consider your embedding model's limit — Most models handle 512 tokens well; some support up to 8192
import tiktoken encoder = tiktoken.encoding_for_model("text-embedding-3-small") def count_tokens(text: str) -> int: return len(encoder.encode(text)) # Ensure chunks are within token limits for chunk in chunks: tokens = count_tokens(chunk) if tokens > 512: print(f"Warning: chunk has {tokens} tokens (recommended max: 512)")
- Enrich chunks with context — Prepend parent information so each chunk stands alone:
enriched_chunks = [] for chunk in chunks: enriched = f"Document: {doc_title}\nSection: {chunk['header']}\n\n{chunk['content']}" enriched_chunks.append(enriched)
What to ask your AI: "I have [document type] documents averaging [size]. Recommend the best chunking strategy, chunk size, and overlap for my RAG pipeline."
What's Next?
Your documents are chunked and indexed. But how do you know if your RAG system is actually returning good results? Next, we cover evaluating RAG quality — metrics, testing strategies, and common failure modes.