The Complete Guide to RAG Chunking: 6 Strategies with Code

In our article on advanced RAG optimization, we covered chunking as one of three techniques that take retrieval from functional to genuinely good. The advice was clear: start with recursive chunking, measure your metrics, optimize when data shows you need to.

But that raises a question: what are you actually optimizing toward?

There are six major chunking strategies, each with different tradeoffs. Recursive chunking is the default for good reason — Firecrawl's benchmarks show 85-90% recall at 400-512 tokens. But it's not always the best choice. Page-level chunking won NVIDIA's 2024 benchmarks with 0.648 accuracy. Semantic chunking can improve accuracy by up to 70% according to LangCopilot's testing. Sentence-based splitting outperforms semantic approaches with certain embedding models.

This deep dive covers all six strategies with benchmark data, code examples, and a decision framework for choosing between them.

Why Chunking Is Your Biggest Lever

Chunking determines what "a result" even means in your RAG system.

A chunk that splits a key sentence in half can't be found — it doesn't contain a complete thought. A chunk that's too large matches everything vaguely and nothing specifically. The difference between strategies can mean up to 9% variation in recall, according to Firecrawl's testing.

Here's what makes this worse: the popular OpenAI Assistants default of 800 tokens with 400 overlap scored "below-average recall and lowest scores across all other metrics" in Chroma Research's evaluation. The default is actively hurting you.

Query type matters too. Factoid queries ("What's the deadline?") perform best at 256-512 tokens. Analytical queries ("What are the key themes?") need 1024+ tokens. One size does not fit all.

The good news: unlike embedding models or LLMs, chunking is endlessly tunable. You control it completely. That's where the leverage lives.

Strategy 1: Fixed-Size Chunking

The blunt instrument.

Fixed-size chunking splits every N tokens or characters regardless of content. No awareness of sentences, paragraphs, or meaning — just a counter that triggers a split.

Python

def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

Why you'd use it: It's fast. Predictable chunk sizes simplify batch processing. Zero computational overhead.

Why you shouldn't: It cuts through sentences mid-thought. "The refund policy allows returns within" in one chunk, "30 days of purchase" in another. Neither chunk is useful.

Benchmark reality: Lower accuracy than structure-aware methods across all major benchmarks. There's no scenario where fixed-size beats recursive except speed of implementation.

Verdict: Use for prototyping to get something running quickly. Replace before production.

Strategy 2: Recursive Character Splitting

The pragmatic default.

Recursive splitting works hierarchically. It tries to split on paragraph breaks first. If chunks are still too large, it splits on newlines. Then sentences. Then words. Then characters. The result: chunks that respect document structure naturally.

Python

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document)

Why it works: The separator hierarchy matches how humans structure documents. Paragraphs contain related ideas. Sentences contain complete thoughts. The splitter respects these boundaries when it can.

Benchmark reality: 85-90% recall at 400-512 tokens with 10-20% overlap according to Firecrawl. Chroma Research measured 88.1% recall at 200 tokens. Consistently strong across document types.

When to use it: Start here for 80% of RAG applications. Articles, documentation, research papers, support content — recursive handles them all.

Customization tip: For code files, adjust the separators to respect language structure:

Python

code_separators = ["\nclass ", "\ndef ", "\n\n", "\n", " ", ""]

Verdict: The LangChain default for good reason. Start here, measure, and only optimize when data shows you need to.

Strategy 3: Sentence-Based Chunking

The NLP approach.

Sentence-based chunking uses natural language processing to detect sentence boundaries, then groups complete sentences into chunks. No sentence ever gets split in half.

Python

from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20
)
nodes = parser.get_nodes_from_documents(documents)

Why it matters: Users always see complete thoughts. "Returns must be initiated within 30 days" stays intact. The semantic unit matches how humans read.

The surprise finding: In Firecrawl's testing, sentence-based splitting outperformed semantic chunking when using ColBERT v2 embeddings. The simpler approach won.

The tradeoff: Sentence length varies wildly. One chunk might be 50 tokens, the next 500. This variability can complicate batch processing and make retrieval scoring less predictable.

When to use it: Conversational data, Q&A content, transcripts, short-form material where sentence boundaries are the natural semantic units.

Verdict: Strong choice when you need guaranteed complete thoughts. Particularly effective for dialogue and FAQ content.

Strategy 4: Page-Level Chunking

The sleeper pick.

Page-level chunking treats each PDF page as a separate chunk. One page, one chunk. No splitting within pages.

Python

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="document.pdf",
    strategy="hi_res",
    multipage_sections=False  # Keep pages separate
)

The benchmark that surprised everyone: NVIDIA's 2024 evaluation tested seven chunking strategies across five datasets. Page-level chunking won with 0.648 accuracy — and crucially, the lowest variance at 0.107. It performed most consistently across document types.

Why it wins: Pages are designed by humans to contain related information. A financial report's page 12 has the Q3 earnings. A research paper's page 7 has the methodology. The author already did your semantic grouping.

Tables and figures stay intact. A table that spans 40 rows doesn't get split into meaningless fragments. Charts keep their captions. Visual layout is preserved.

The limitation: Only works for paginated documents. Markdown files, HTML, raw text — these don't have pages.

When to use it: PDFs with visual layouts, financial reports, research papers with tables and figures, legal documents with page citations, scanned documents.

Verdict: If you're working with PDFs and not using page-level chunking, you're probably leaving accuracy on the table.

Strategy 5: Semantic Chunking

The premium option.

Semantic chunking analyzes embedding similarity between consecutive sentences. Where the topic shifts — where similarity drops — it splits. Each chunk becomes thematically coherent.

Python

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)
chunks = chunker.split_text(document)

How it works under the hood:

Embed every sentence in your document
Calculate cosine similarity between adjacent sentences
Find where similarity drops below your threshold (typically 95th percentile)
Split at those breakpoints

The benchmark case: LangCopilot's testing showed up to 70% accuracy improvement over naive approaches. Chroma Research measured the ClusterSemanticChunker variant at 0.919 recall — the highest in their evaluation. It also achieved 8.0% IoU compared to 6.9% for recursive at the same token count.

The tradeoff is real: You're embedding every sentence before you even start retrieval. For a 10,000-word document, that's hundreds of embedding calls. API costs add up. Processing time multiplies.

When to use it: Accuracy is paramount and budget allows. Domain-specific content where topic boundaries matter. High-value documents where the compute cost is justified.

The honest question: Does a 3% improvement in recall justify 10x the processing time and cost? Sometimes yes. Usually no. Test before committing.

Verdict: The premium option that delivers when you need it. Not the default.

Strategy 6: LLM-Based Chunking

The frontier.

LLM-based chunking sends document sections to a language model and asks it to identify optimal split points. The LLM understands context, structure, and meaning in ways that rule-based approaches can't.

The benchmark promise: Chroma Research's ClusterSemanticChunker variant reached 0.913 recall. The quality is there.

The cost reality: You're paying for LLM inference on every document you ingest. For a corpus of 10,000 documents, that's 10,000 API calls before you've answered a single query. At current pricing, that adds up fast.

Where it makes sense:

High-value content where accuracy justifies cost (legal contracts, medical records)
One-time ingestion of critical documents
Research and experimentation
Generating metadata simultaneously (summaries, topics, entities)

Where it doesn't: Production systems at scale. Frequently updated corpora. Budget-constrained projects.

Verdict: Watch this space. The quality is promising. The economics aren't there yet for most use cases.

The Decision Framework

Here's how to choose:

Your situation	Start with	Why
Just getting started	Recursive (400-512 tokens)	Best balance, fastest iteration
PDFs with tables/figures	Page-level	Won NVIDIA benchmarks, preserves layout
Q&A or conversational content	Sentence-based	Maintains complete thoughts
Retrieval metrics disappointing	Semantic	Up to 70% accuracy improvement
Code repositories	Recursive with language separators	Respects code structure
Tight budget, speed critical	Fixed-size	Fast but lower quality
High-value, accuracy critical	Semantic or LLM-based	Maximum quality

Query-type adjustments:

Factoid queries ("What's the deadline?"): 256-512 tokens
Analytical queries ("What are the key themes?"): 1024+ tokens
Mixed workload: 400-512 tokens (the balanced default)

The Parameters That Actually Matter

Chunk Size

Align with your embedding model's optimal input — typically 256, 512, or 1024 tokens. Chroma Research found that reducing from 400 to 200 tokens doubled precision scores. Smaller chunks mean more specific matches. Larger chunks mean more context per result.

Overlap

The standard rule: 10-20% of chunk size. Overlap prevents splitting sentences at boundaries. "For more details, see" in one chunk connects to "the following requirements" in the next.

The tradeoff: Chroma found that overlap improves recall but degrades IoU — you're penalized for redundancy. More overlap means more storage and potential duplicate retrieval.

Metadata — The Often-Missed Parameter

Include section headers in each chunk. "Returns must be initiated within 30 days" is useless without knowing it's from the "Refund Policy" section.

Source tracking enables verification. Users can check your work. That builds trust.

Testing Your Chunks

You won't know if chunking is working without measuring.

The 5-minute test:

Pick 10 real questions users ask
Identify which chunks should be retrieved
Run queries and check recall@5

What to measure:

Recall@K: Are the right chunks in your results?
Precision@K: How much noise are you retrieving?
IoU (Intersection over Union): How well do chunk boundaries align with relevant content?

Diagnosis: Low recall means chunks are too large or semantically incoherent. Low precision means chunks are too small or too similar to each other.

Chroma Research generated synthetic queries via GPT-4 at roughly $0.01 per question. Cheap enough to build a proper test set.

The Bottom Line

Six strategies. Different tradeoffs.

Recursive chunking is the pragmatic default — 85-90% recall, works across document types, no compute overhead. Start here.

Page-level chunking is the sleeper pick for PDFs — won NVIDIA's benchmarks, preserves tables and figures, often overlooked.

Semantic chunking is the premium option — up to 70% improvement when you need it, but the compute cost is real.

The 9% recall gap between strategies matters. But so does shipping. Get chunking right enough to stop being your bottleneck, then focus on retrieval and generation.

Each chunk should be able to answer a question on its own. If it can't, it's the wrong chunk.

This is a deep dive in our RAG series. For the overview of chunking alongside hybrid search and reranking, see Beyond Basic RAG: Chunking, Hybrid Search, and Reranking.

But that raises a question: what are you actually optimizing toward?

This deep dive covers all six strategies with benchmark data, code examples, and a decision framework for choosing between them.

Why Chunking Is Your Biggest Lever

Chunking determines what "a result" even means in your RAG system.

Query type matters too. Factoid queries ("What's the deadline?") perform best at 256-512 tokens. Analytical queries ("What are the key themes?") need 1024+ tokens. One size does not fit all.

The good news: unlike embedding models or LLMs, chunking is endlessly tunable. You control it completely. That's where the leverage lives.

Strategy 1: Fixed-Size Chunking

The blunt instrument.

Fixed-size chunking splits every N tokens or characters regardless of content. No awareness of sentences, paragraphs, or meaning — just a counter that triggers a split.

Python

def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

Why you'd use it: It's fast. Predictable chunk sizes simplify batch processing. Zero computational overhead.

Why you shouldn't: It cuts through sentences mid-thought. "The refund policy allows returns within" in one chunk, "30 days of purchase" in another. Neither chunk is useful.

Benchmark reality: Lower accuracy than structure-aware methods across all major benchmarks. There's no scenario where fixed-size beats recursive except speed of implementation.

Verdict: Use for prototyping to get something running quickly. Replace before production.

Strategy 2: Recursive Character Splitting

The pragmatic default.

Python

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document)

Benchmark reality: 85-90% recall at 400-512 tokens with 10-20% overlap according to Firecrawl. Chroma Research measured 88.1% recall at 200 tokens. Consistently strong across document types.

When to use it: Start here for 80% of RAG applications. Articles, documentation, research papers, support content — recursive handles them all.

Customization tip: For code files, adjust the separators to respect language structure:

Python

code_separators = ["\nclass ", "\ndef ", "\n\n", "\n", " ", ""]

Verdict: The LangChain default for good reason. Start here, measure, and only optimize when data shows you need to.

Strategy 3: Sentence-Based Chunking

The NLP approach.

Sentence-based chunking uses natural language processing to detect sentence boundaries, then groups complete sentences into chunks. No sentence ever gets split in half.

Python

from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20
)
nodes = parser.get_nodes_from_documents(documents)

Why it matters: Users always see complete thoughts. "Returns must be initiated within 30 days" stays intact. The semantic unit matches how humans read.

The surprise finding: In Firecrawl's testing, sentence-based splitting outperformed semantic chunking when using ColBERT v2 embeddings. The simpler approach won.

The tradeoff: Sentence length varies wildly. One chunk might be 50 tokens, the next 500. This variability can complicate batch processing and make retrieval scoring less predictable.

When to use it: Conversational data, Q&A content, transcripts, short-form material where sentence boundaries are the natural semantic units.

Verdict: Strong choice when you need guaranteed complete thoughts. Particularly effective for dialogue and FAQ content.

Strategy 4: Page-Level Chunking

The sleeper pick.

Page-level chunking treats each PDF page as a separate chunk. One page, one chunk. No splitting within pages.

Python

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="document.pdf",
    strategy="hi_res",
    multipage_sections=False  # Keep pages separate
)

Tables and figures stay intact. A table that spans 40 rows doesn't get split into meaningless fragments. Charts keep their captions. Visual layout is preserved.

The limitation: Only works for paginated documents. Markdown files, HTML, raw text — these don't have pages.

When to use it: PDFs with visual layouts, financial reports, research papers with tables and figures, legal documents with page citations, scanned documents.

Verdict: If you're working with PDFs and not using page-level chunking, you're probably leaving accuracy on the table.

Strategy 5: Semantic Chunking

The premium option.

Semantic chunking analyzes embedding similarity between consecutive sentences. Where the topic shifts — where similarity drops — it splits. Each chunk becomes thematically coherent.

Python

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)
chunks = chunker.split_text(document)

How it works under the hood:

Embed every sentence in your document
Calculate cosine similarity between adjacent sentences
Find where similarity drops below your threshold (typically 95th percentile)
Split at those breakpoints

When to use it: Accuracy is paramount and budget allows. Domain-specific content where topic boundaries matter. High-value documents where the compute cost is justified.

The honest question: Does a 3% improvement in recall justify 10x the processing time and cost? Sometimes yes. Usually no. Test before committing.

Verdict: The premium option that delivers when you need it. Not the default.

Strategy 6: LLM-Based Chunking

The frontier.

The benchmark promise: Chroma Research's ClusterSemanticChunker variant reached 0.913 recall. The quality is there.

Where it makes sense:

High-value content where accuracy justifies cost (legal contracts, medical records)
One-time ingestion of critical documents
Research and experimentation
Generating metadata simultaneously (summaries, topics, entities)

Where it doesn't: Production systems at scale. Frequently updated corpora. Budget-constrained projects.

Verdict: Watch this space. The quality is promising. The economics aren't there yet for most use cases.

The Decision Framework

Here's how to choose:

Your situation	Start with	Why
Just getting started	Recursive (400-512 tokens)	Best balance, fastest iteration
PDFs with tables/figures	Page-level	Won NVIDIA benchmarks, preserves layout
Q&A or conversational content	Sentence-based	Maintains complete thoughts
Retrieval metrics disappointing	Semantic	Up to 70% accuracy improvement
Code repositories	Recursive with language separators	Respects code structure
Tight budget, speed critical	Fixed-size	Fast but lower quality
High-value, accuracy critical	Semantic or LLM-based	Maximum quality

Query-type adjustments:

Factoid queries ("What's the deadline?"): 256-512 tokens
Analytical queries ("What are the key themes?"): 1024+ tokens
Mixed workload: 400-512 tokens (the balanced default)

The Parameters That Actually Matter

Chunk Size

Overlap

The standard rule: 10-20% of chunk size. Overlap prevents splitting sentences at boundaries. "For more details, see" in one chunk connects to "the following requirements" in the next.

The tradeoff: Chroma found that overlap improves recall but degrades IoU — you're penalized for redundancy. More overlap means more storage and potential duplicate retrieval.

Metadata — The Often-Missed Parameter

Include section headers in each chunk. "Returns must be initiated within 30 days" is useless without knowing it's from the "Refund Policy" section.

Source tracking enables verification. Users can check your work. That builds trust.

Testing Your Chunks

You won't know if chunking is working without measuring.

The 5-minute test:

Pick 10 real questions users ask
Identify which chunks should be retrieved
Run queries and check recall@5

What to measure:

Recall@K: Are the right chunks in your results?
Precision@K: How much noise are you retrieving?
IoU (Intersection over Union): How well do chunk boundaries align with relevant content?

Diagnosis: Low recall means chunks are too large or semantically incoherent. Low precision means chunks are too small or too similar to each other.

Chroma Research generated synthetic queries via GPT-4 at roughly $0.01 per question. Cheap enough to build a proper test set.

The Bottom Line

Six strategies. Different tradeoffs.

Recursive chunking is the pragmatic default — 85-90% recall, works across document types, no compute overhead. Start here.

Page-level chunking is the sleeper pick for PDFs — won NVIDIA's benchmarks, preserves tables and figures, often overlooked.

Semantic chunking is the premium option — up to 70% improvement when you need it, but the compute cost is real.

The 9% recall gap between strategies matters. But so does shipping. Get chunking right enough to stop being your bottleneck, then focus on retrieval and generation.

Each chunk should be able to answer a question on its own. If it can't, it's the wrong chunk.

This is a deep dive in our RAG series. For the overview of chunking alongside hybrid search and reranking, see Beyond Basic RAG: Chunking, Hybrid Search, and Reranking.

Why Chunking Is Your Biggest Lever

Strategy 1: Fixed-Size Chunking

Strategy 2: Recursive Character Splitting

Strategy 3: Sentence-Based Chunking

Strategy 4: Page-Level Chunking

Strategy 5: Semantic Chunking

Strategy 6: LLM-Based Chunking

The Decision Framework

The Parameters That Actually Matter

Chunk Size

Overlap

Metadata — The Often-Missed Parameter

Testing Your Chunks

The Bottom Line

Rosh Jayawardena

Discussion

Continue Reading

Gaslighting Your AI Into Better Results: What the Research Actually Shows

Microsoft Is So Worried About Claude Code, They're Testing It Against Copilot

RAG vs. Long Context Windows: A Decision Framework for Research Workflows

Deep dives, delivered weekly

Why Chunking Is Your Biggest Lever

Strategy 1: Fixed-Size Chunking

Strategy 2: Recursive Character Splitting

Strategy 3: Sentence-Based Chunking

Strategy 4: Page-Level Chunking

Strategy 5: Semantic Chunking

Strategy 6: LLM-Based Chunking

The Decision Framework

The Parameters That Actually Matter

Chunk Size

Overlap

Metadata — The Often-Missed Parameter

Testing Your Chunks

The Bottom Line

Rosh Jayawardena

Discussion

Continue Reading

Gaslighting Your AI Into Better Results: What the Research Actually Shows

Microsoft Is So Worried About Claude Code, They're Testing It Against Copilot

RAG vs. Long Context Windows: A Decision Framework for Research Workflows

Deep dives, delivered weekly