
How you split your documents determines whether RAG finds what you need or returns noise. Here's the complete breakdown with code.
In our article on advanced RAG optimization, we covered chunking as one of three techniques that take retrieval from functional to genuinely good. The advice was clear: start with recursive chunking, measure your metrics, optimize when data shows you need to.
But that raises a question: what are you actually optimizing toward?
There are six major chunking strategies, each with different tradeoffs. Recursive chunking is the default for good reason — Firecrawl's benchmarks show 85-90% recall at 400-512 tokens. But it's not always the best choice. Page-level chunking won NVIDIA's 2024 benchmarks with 0.648 accuracy. Semantic chunking can improve accuracy by up to 70% according to LangCopilot's testing. Sentence-based splitting outperforms semantic approaches with certain embedding models.
This deep dive covers all six strategies with benchmark data, code examples, and a decision framework for choosing between them.
Chunking determines what "a result" even means in your RAG system.
A chunk that splits a key sentence in half can't be found — it doesn't contain a complete thought. A chunk that's too large matches everything vaguely and nothing specifically. The difference between strategies can mean up to 9% variation in recall, according to Firecrawl's testing.
Here's what makes this worse: the popular OpenAI Assistants default of 800 tokens with 400 overlap scored "below-average recall and lowest scores across all other metrics" in Chroma Research's evaluation. The default is actively hurting you.
Query type matters too. Factoid queries ("What's the deadline?") perform best at 256-512 tokens. Analytical queries ("What are the key themes?") need 1024+ tokens. One size does not fit all.
The good news: unlike embedding models or LLMs, chunking is endlessly tunable. You control it completely. That's where the leverage lives.
The blunt instrument.
Fixed-size chunking splits every N tokens or characters regardless of content. No awareness of sentences, paragraphs, or meaning — just a counter that triggers a split.
def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
Why you'd use it: It's fast. Predictable chunk sizes simplify batch processing. Zero computational overhead.
Why you shouldn't: It cuts through sentences mid-thought. "The refund policy allows returns within" in one chunk, "30 days of purchase" in another. Neither chunk is useful.
Benchmark reality: Lower accuracy than structure-aware methods across all major benchmarks. There's no scenario where fixed-size beats recursive except speed of implementation.
Verdict: Use for prototyping to get something running quickly. Replace before production.
The pragmatic default.
Recursive splitting works hierarchically. It tries to split on paragraph breaks first. If chunks are still too large, it splits on newlines. Then sentences. Then words. Then characters. The result: chunks that respect document structure naturally.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document)
Why it works: The separator hierarchy matches how humans structure documents. Paragraphs contain related ideas. Sentences contain complete thoughts. The splitter respects these boundaries when it can.
Benchmark reality: 85-90% recall at 400-512 tokens with 10-20% overlap according to Firecrawl. Chroma Research measured 88.1% recall at 200 tokens. Consistently strong across document types.
When to use it: Start here for 80% of RAG applications. Articles, documentation, research papers, support content — recursive handles them all.
Customization tip: For code files, adjust the separators to respect language structure:
code_separators = ["\nclass ", "\ndef ", "\n\n", "\n", " ", ""]
Verdict: The LangChain default for good reason. Start here, measure, and only optimize when data shows you need to.
The NLP approach.
Sentence-based chunking uses natural language processing to detect sentence boundaries, then groups complete sentences into chunks. No sentence ever gets split in half.
from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter(
chunk_size=1024,
chunk_overlap=20
)
nodes = parser.get_nodes_from_documents(documents)
Why it matters: Users always see complete thoughts. "Returns must be initiated within 30 days" stays intact. The semantic unit matches how humans read.
The surprise finding: In Firecrawl's testing, sentence-based splitting outperformed semantic chunking when using ColBERT v2 embeddings. The simpler approach won.
The tradeoff: Sentence length varies wildly. One chunk might be 50 tokens, the next 500. This variability can complicate batch processing and make retrieval scoring less predictable.
When to use it: Conversational data, Q&A content, transcripts, short-form material where sentence boundaries are the natural semantic units.
Verdict: Strong choice when you need guaranteed complete thoughts. Particularly effective for dialogue and FAQ content.
The sleeper pick.
Page-level chunking treats each PDF page as a separate chunk. One page, one chunk. No splitting within pages.
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="document.pdf",
strategy="hi_res",
multipage_sections=False # Keep pages separate
)
The benchmark that surprised everyone: NVIDIA's 2024 evaluation tested seven chunking strategies across five datasets. Page-level chunking won with 0.648 accuracy — and crucially, the lowest variance at 0.107. It performed most consistently across document types.
Why it wins: Pages are designed by humans to contain related information. A financial report's page 12 has the Q3 earnings. A research paper's page 7 has the methodology. The author already did your semantic grouping.
Tables and figures stay intact. A table that spans 40 rows doesn't get split into meaningless fragments. Charts keep their captions. Visual layout is preserved.
The limitation: Only works for paginated documents. Markdown files, HTML, raw text — these don't have pages.
When to use it: PDFs with visual layouts, financial reports, research papers with tables and figures, legal documents with page citations, scanned documents.
Verdict: If you're working with PDFs and not using page-level chunking, you're probably leaving accuracy on the table.
The premium option.
Semantic chunking analyzes embedding similarity between consecutive sentences. Where the topic shifts — where similarity drops — it splits. Each chunk becomes thematically coherent.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
chunker = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = chunker.split_text(document)
How it works under the hood:
The benchmark case: LangCopilot's testing showed up to 70% accuracy improvement over naive approaches. Chroma Research measured the ClusterSemanticChunker variant at 0.919 recall — the highest in their evaluation. It also achieved 8.0% IoU compared to 6.9% for recursive at the same token count.
The tradeoff is real: You're embedding every sentence before you even start retrieval. For a 10,000-word document, that's hundreds of embedding calls. API costs add up. Processing time multiplies.
When to use it: Accuracy is paramount and budget allows. Domain-specific content where topic boundaries matter. High-value documents where the compute cost is justified.
The honest question: Does a 3% improvement in recall justify 10x the processing time and cost? Sometimes yes. Usually no. Test before committing.
Verdict: The premium option that delivers when you need it. Not the default.
The frontier.
LLM-based chunking sends document sections to a language model and asks it to identify optimal split points. The LLM understands context, structure, and meaning in ways that rule-based approaches can't.
The benchmark promise: Chroma Research's ClusterSemanticChunker variant reached 0.913 recall. The quality is there.
The cost reality: You're paying for LLM inference on every document you ingest. For a corpus of 10,000 documents, that's 10,000 API calls before you've answered a single query. At current pricing, that adds up fast.
Where it makes sense:
Where it doesn't: Production systems at scale. Frequently updated corpora. Budget-constrained projects.
Verdict: Watch this space. The quality is promising. The economics aren't there yet for most use cases.
![]()
Here's how to choose:
| Your situation | Start with | Why |
|---|---|---|
| Just getting started | Recursive (400-512 tokens) | Best balance, fastest iteration |
| PDFs with tables/figures | Page-level | Won NVIDIA benchmarks, preserves layout |
| Q&A or conversational content | Sentence-based | Maintains complete thoughts |
| Retrieval metrics disappointing | Semantic | Up to 70% accuracy improvement |
| Code repositories | Recursive with language separators | Respects code structure |
| Tight budget, speed critical | Fixed-size | Fast but lower quality |
| High-value, accuracy critical | Semantic or LLM-based | Maximum quality |
Query-type adjustments:
![]()
Align with your embedding model's optimal input — typically 256, 512, or 1024 tokens. Chroma Research found that reducing from 400 to 200 tokens doubled precision scores. Smaller chunks mean more specific matches. Larger chunks mean more context per result.
The standard rule: 10-20% of chunk size. Overlap prevents splitting sentences at boundaries. "For more details, see" in one chunk connects to "the following requirements" in the next.
The tradeoff: Chroma found that overlap improves recall but degrades IoU — you're penalized for redundancy. More overlap means more storage and potential duplicate retrieval.
Include section headers in each chunk. "Returns must be initiated within 30 days" is useless without knowing it's from the "Refund Policy" section.
Source tracking enables verification. Users can check your work. That builds trust.
You won't know if chunking is working without measuring.
The 5-minute test:
What to measure:
Diagnosis: Low recall means chunks are too large or semantically incoherent. Low precision means chunks are too small or too similar to each other.
Chroma Research generated synthetic queries via GPT-4 at roughly $0.01 per question. Cheap enough to build a proper test set.
Six strategies. Different tradeoffs.
Recursive chunking is the pragmatic default — 85-90% recall, works across document types, no compute overhead. Start here.
Page-level chunking is the sleeper pick for PDFs — won NVIDIA's benchmarks, preserves tables and figures, often overlooked.
Semantic chunking is the premium option — up to 70% improvement when you need it, but the compute cost is real.
The 9% recall gap between strategies matters. But so does shipping. Get chunking right enough to stop being your bottleneck, then focus on retrieval and generation.
Each chunk should be able to answer a question on its own. If it can't, it's the wrong chunk.
This is a deep dive in our RAG series. For the overview of chunking alongside hybrid search and reranking, see Beyond Basic RAG: Chunking, Hybrid Search, and Reranking.
I lead data & AI for New Zealand's largest insurer. Before that, 10+ years building enterprise software. I write about AI for people who need to finish things, not just play with tools

A Reddit post about telling Claude you work at a hospital went viral. Turns out there's actual research explaining why this works across all LLMs.

Microsoft just told thousands of engineers to install Claude Code and compare it to Copilot. When you're running internal benchmarks against a competitor, you're not confident you're winning.

Long context windows are getting massive—but that doesn't mean RAG is dead. Here's when each approach actually works, with real numbers.
AI patterns, workflow tips, and lessons from the field. No spam, just signal.