Onsombleonsombleai
FeaturesHow It WorksPricingBlog
Sign InGet Early Access
Onsombleonsombleai

Research. Write. Present.
All in one workspace.

Product
  • Features
  • Pricing
  • Docs
Resources
  • Blog
  • Changelog
  • Help
Legal
  • Privacy
  • Terms
Connect

© 2026 Onsomble AI. All rights reserved.

Built for knowledge workers who ship.

Onsombleonsombleai
FeaturesHow It WorksPricingBlog
Sign InGet Early Access
Onsombleonsombleai

Research. Write. Present.
All in one workspace.

Product
  • Features
  • Pricing
  • Docs
Resources
  • Blog
  • Changelog
  • Help
Legal
  • Privacy
  • Terms
Connect

© 2026 Onsomble AI. All rights reserved.

Built for knowledge workers who ship.

RSS
Contents
  • Why Chunking Is Your Biggest Lever
  • Strategy 1: Fixed-Size Chunking
  • Strategy 2: Recursive Character Splitting
  • Strategy 3: Sentence-Based Chunking
  • Strategy 4: Page-Level Chunking
  • Strategy 5: Semantic Chunking
  • Strategy 6: LLM-Based Chunking
  • The Decision Framework
  • The Parameters That Actually Matter
  • Chunk Size
  • Overlap
  • Metadata — The Often-Missed Parameter
  • Testing Your Chunks
  • The Bottom Line
Onsombleonsombleai
FeaturesHow It WorksPricingBlog
Sign InGet Early Access
Back to Blog
The Complete Guide to RAG Chunking: 6 Strategies with Code
Engineering12 min read•January 2, 2026

The Complete Guide to RAG Chunking: 6 Strategies with Code

How you split your documents determines whether RAG finds what you need or returns noise. Here's the complete breakdown with code.

Rosh Jayawardena
Rosh Jayawardena
Data & AI Executive

In our article on advanced RAG optimization, we covered chunking as one of three techniques that take retrieval from functional to genuinely good. The advice was clear: start with recursive chunking, measure your metrics, optimize when data shows you need to.

But that raises a question: what are you actually optimizing toward?

There are six major chunking strategies, each with different tradeoffs. Recursive chunking is the default for good reason — Firecrawl's benchmarks show 85-90% recall at 400-512 tokens. But it's not always the best choice. Page-level chunking won NVIDIA's 2024 benchmarks with 0.648 accuracy. Semantic chunking can improve accuracy by up to 70% according to LangCopilot's testing. Sentence-based splitting outperforms semantic approaches with certain embedding models.

This deep dive covers all six strategies with benchmark data, code examples, and a decision framework for choosing between them.

Why Chunking Is Your Biggest Lever

Chunking determines what "a result" even means in your RAG system.

A chunk that splits a key sentence in half can't be found — it doesn't contain a complete thought. A chunk that's too large matches everything vaguely and nothing specifically. The difference between strategies can mean up to 9% variation in recall, according to Firecrawl's testing.

Here's what makes this worse: the popular OpenAI Assistants default of 800 tokens with 400 overlap scored "below-average recall and lowest scores across all other metrics" in Chroma Research's evaluation. The default is actively hurting you.

Query type matters too. Factoid queries ("What's the deadline?") perform best at 256-512 tokens. Analytical queries ("What are the key themes?") need 1024+ tokens. One size does not fit all.

The good news: unlike embedding models or LLMs, chunking is endlessly tunable. You control it completely. That's where the leverage lives.

Strategy 1: Fixed-Size Chunking

The blunt instrument.

Fixed-size chunking splits every N tokens or characters regardless of content. No awareness of sentences, paragraphs, or meaning — just a counter that triggers a split.

Python
def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

 

Why you'd use it: It's fast. Predictable chunk sizes simplify batch processing. Zero computational overhead.

Why you shouldn't: It cuts through sentences mid-thought. "The refund policy allows returns within" in one chunk, "30 days of purchase" in another. Neither chunk is useful.

Benchmark reality: Lower accuracy than structure-aware methods across all major benchmarks. There's no scenario where fixed-size beats recursive except speed of implementation.

Verdict: Use for prototyping to get something running quickly. Replace before production.

Strategy 2: Recursive Character Splitting

The pragmatic default.

Recursive splitting works hierarchically. It tries to split on paragraph breaks first. If chunks are still too large, it splits on newlines. Then sentences. Then words. Then characters. The result: chunks that respect document structure naturally.

Python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document)

Why it works: The separator hierarchy matches how humans structure documents. Paragraphs contain related ideas. Sentences contain complete thoughts. The splitter respects these boundaries when it can.

Benchmark reality: 85-90% recall at 400-512 tokens with 10-20% overlap according to Firecrawl. Chroma Research measured 88.1% recall at 200 tokens. Consistently strong across document types.

When to use it: Start here for 80% of RAG applications. Articles, documentation, research papers, support content — recursive handles them all.

Customization tip: For code files, adjust the separators to respect language structure:

Python
code_separators = ["\nclass ", "\ndef ", "\n\n", "\n", " ", ""]

Verdict: The LangChain default for good reason. Start here, measure, and only optimize when data shows you need to.

Strategy 3: Sentence-Based Chunking

The NLP approach.

Sentence-based chunking uses natural language processing to detect sentence boundaries, then groups complete sentences into chunks. No sentence ever gets split in half.

Python
from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20
)
nodes = parser.get_nodes_from_documents(documents)

Why it matters: Users always see complete thoughts. "Returns must be initiated within 30 days" stays intact. The semantic unit matches how humans read.

The surprise finding: In Firecrawl's testing, sentence-based splitting outperformed semantic chunking when using ColBERT v2 embeddings. The simpler approach won.

The tradeoff: Sentence length varies wildly. One chunk might be 50 tokens, the next 500. This variability can complicate batch processing and make retrieval scoring less predictable.

When to use it: Conversational data, Q&A content, transcripts, short-form material where sentence boundaries are the natural semantic units.

Verdict: Strong choice when you need guaranteed complete thoughts. Particularly effective for dialogue and FAQ content.

Strategy 4: Page-Level Chunking

The sleeper pick.

Page-level chunking treats each PDF page as a separate chunk. One page, one chunk. No splitting within pages.

Python
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="document.pdf",
    strategy="hi_res",
    multipage_sections=False  # Keep pages separate
)

The benchmark that surprised everyone: NVIDIA's 2024 evaluation tested seven chunking strategies across five datasets. Page-level chunking won with 0.648 accuracy — and crucially, the lowest variance at 0.107. It performed most consistently across document types.

Why it wins: Pages are designed by humans to contain related information. A financial report's page 12 has the Q3 earnings. A research paper's page 7 has the methodology. The author already did your semantic grouping.

Tables and figures stay intact. A table that spans 40 rows doesn't get split into meaningless fragments. Charts keep their captions. Visual layout is preserved.

The limitation: Only works for paginated documents. Markdown files, HTML, raw text — these don't have pages.

When to use it: PDFs with visual layouts, financial reports, research papers with tables and figures, legal documents with page citations, scanned documents.

Verdict: If you're working with PDFs and not using page-level chunking, you're probably leaving accuracy on the table.

Strategy 5: Semantic Chunking

The premium option.

Semantic chunking analyzes embedding similarity between consecutive sentences. Where the topic shifts — where similarity drops — it splits. Each chunk becomes thematically coherent.

Python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)
chunks = chunker.split_text(document)

How it works under the hood:

  1. Embed every sentence in your document
  2. Calculate cosine similarity between adjacent sentences
  3. Find where similarity drops below your threshold (typically 95th percentile)
  4. Split at those breakpoints

The benchmark case: LangCopilot's testing showed up to 70% accuracy improvement over naive approaches. Chroma Research measured the ClusterSemanticChunker variant at 0.919 recall — the highest in their evaluation. It also achieved 8.0% IoU compared to 6.9% for recursive at the same token count.

The tradeoff is real: You're embedding every sentence before you even start retrieval. For a 10,000-word document, that's hundreds of embedding calls. API costs add up. Processing time multiplies.

When to use it: Accuracy is paramount and budget allows. Domain-specific content where topic boundaries matter. High-value documents where the compute cost is justified.

The honest question: Does a 3% improvement in recall justify 10x the processing time and cost? Sometimes yes. Usually no. Test before committing.

Verdict: The premium option that delivers when you need it. Not the default.

Strategy 6: LLM-Based Chunking

The frontier.

LLM-based chunking sends document sections to a language model and asks it to identify optimal split points. The LLM understands context, structure, and meaning in ways that rule-based approaches can't.

The benchmark promise: Chroma Research's ClusterSemanticChunker variant reached 0.913 recall. The quality is there.

The cost reality: You're paying for LLM inference on every document you ingest. For a corpus of 10,000 documents, that's 10,000 API calls before you've answered a single query. At current pricing, that adds up fast.

Where it makes sense:

  • High-value content where accuracy justifies cost (legal contracts, medical records)
  • One-time ingestion of critical documents
  • Research and experimentation
  • Generating metadata simultaneously (summaries, topics, entities)

Where it doesn't: Production systems at scale. Frequently updated corpora. Budget-constrained projects.

Verdict: Watch this space. The quality is promising. The economics aren't there yet for most use cases.

 

The Decision Framework

Here's how to choose:

Your situation Start with Why
Just getting started Recursive (400-512 tokens) Best balance, fastest iteration
PDFs with tables/figures Page-level Won NVIDIA benchmarks, preserves layout
Q&A or conversational content Sentence-based Maintains complete thoughts
Retrieval metrics disappointing Semantic Up to 70% accuracy improvement
Code repositories Recursive with language separators Respects code structure
Tight budget, speed critical Fixed-size Fast but lower quality
High-value, accuracy critical Semantic or LLM-based Maximum quality

Query-type adjustments:

  • Factoid queries ("What's the deadline?"): 256-512 tokens
  • Analytical queries ("What are the key themes?"): 1024+ tokens
  • Mixed workload: 400-512 tokens (the balanced default)

The Parameters That Actually Matter

Chunk Size

Align with your embedding model's optimal input — typically 256, 512, or 1024 tokens. Chroma Research found that reducing from 400 to 200 tokens doubled precision scores. Smaller chunks mean more specific matches. Larger chunks mean more context per result.

Overlap

The standard rule: 10-20% of chunk size. Overlap prevents splitting sentences at boundaries. "For more details, see" in one chunk connects to "the following requirements" in the next.

The tradeoff: Chroma found that overlap improves recall but degrades IoU — you're penalized for redundancy. More overlap means more storage and potential duplicate retrieval.

Metadata — The Often-Missed Parameter

Include section headers in each chunk. "Returns must be initiated within 30 days" is useless without knowing it's from the "Refund Policy" section.

Source tracking enables verification. Users can check your work. That builds trust.

Testing Your Chunks

You won't know if chunking is working without measuring.

The 5-minute test:

  1. Pick 10 real questions users ask
  2. Identify which chunks should be retrieved
  3. Run queries and check recall@5

What to measure:

  • Recall@K: Are the right chunks in your results?
  • Precision@K: How much noise are you retrieving?
  • IoU (Intersection over Union): How well do chunk boundaries align with relevant content?

Diagnosis: Low recall means chunks are too large or semantically incoherent. Low precision means chunks are too small or too similar to each other.

Chroma Research generated synthetic queries via GPT-4 at roughly $0.01 per question. Cheap enough to build a proper test set.

The Bottom Line

Six strategies. Different tradeoffs.

Recursive chunking is the pragmatic default — 85-90% recall, works across document types, no compute overhead. Start here.

Page-level chunking is the sleeper pick for PDFs — won NVIDIA's benchmarks, preserves tables and figures, often overlooked.

Semantic chunking is the premium option — up to 70% improvement when you need it, but the compute cost is real.

The 9% recall gap between strategies matters. But so does shipping. Get chunking right enough to stop being your bottleneck, then focus on retrieval and generation.

Each chunk should be able to answer a question on its own. If it can't, it's the wrong chunk.


This is a deep dive in our RAG series. For the overview of chunking alongside hybrid search and reranking, see Beyond Basic RAG: Chunking, Hybrid Search, and Reranking.

#RAG Basics Series#AI Strategy#Vector DB#RAG#Enterprise
Rosh Jayawardena

Rosh Jayawardena

Data & AI Executive

I lead data & AI for New Zealand's largest insurer. Before that, 10+ years building enterprise software. I write about AI for people who need to finish things, not just play with tools

View all posts→

Discussion

0

Continue Reading

Gaslighting Your AI Into Better Results: What the Research Actually Shows
Engineering8 min read

Gaslighting Your AI Into Better Results: What the Research Actually Shows

A Reddit post about telling Claude you work at a hospital went viral. Turns out there's actual research explaining why this works across all LLMs.

Rosh Jayawardena
Rosh Jayawardena
Jan 29, 2026
Microsoft Is So Worried About Claude Code, They're Testing It Against Copilot
Engineering8 min read

Microsoft Is So Worried About Claude Code, They're Testing It Against Copilot

Microsoft just told thousands of engineers to install Claude Code and compare it to Copilot. When you're running internal benchmarks against a competitor, you're not confident you're winning.

Rosh Jayawardena
Rosh Jayawardena
Jan 23, 2026
RAG vs. Long Context Windows: A Decision Framework for Research Workflows
Engineering9 min read

RAG vs. Long Context Windows: A Decision Framework for Research Workflows

Long context windows are getting massive—but that doesn't mean RAG is dead. Here's when each approach actually works, with real numbers.

Rosh Jayawardena
Rosh Jayawardena
Dec 28, 2025

Deep dives, delivered weekly

AI patterns, workflow tips, and lessons from the field. No spam, just signal.

Onsombleonsombleai

Research. Write. Present.
All in one workspace.

Product
  • Features
  • Pricing
  • Docs
Resources
  • Blog
  • Changelog
  • Help
Legal
  • Privacy
  • Terms
Connect

© 2026 Onsomble AI. All rights reserved.

Built for knowledge workers who ship.