Beyond Basic RAG: Chunking, Hybrid Search, and Reranking

Your RAG system works. It retrieves documents. It generates answers. Users can ask questions and get responses grounded in your content.

But something's off.

The results are mediocre. Sometimes it returns irrelevant chunks. Sometimes it misses the exact document you know contains the answer. Sometimes the response is technically correct but pulls from the wrong section entirely.

You've built basic RAG. Now you need to make it actually good. That's a different skill set — and most tutorials stop right before this part.

This article covers three techniques that take RAG from functional to genuinely good: smarter chunking, hybrid search, and reranking. These aren't theoretical improvements. They're the optimizations that production systems actually use.

Why Basic RAG Disappoints

Basic RAG has predictable failure modes. Once you've seen them, you'll recognize them everywhere.

Chunk boundary failures. The answer exists in your documents, but it got split across two chunks. Neither chunk alone is relevant enough to retrieve. The user gets a partial answer or no answer at all.

Semantic-but-not-exact misses. User searches for "PostgreSQL 17 performance improvements." Vector search returns generic database performance content because it understands the semantics of "performance" but doesn't weight the exact version number. The specific document mentioning 17 never surfaces.

Mediocre ranking. The right chunk is retrieved — but it's number seven in the results, and your top_k is set to five. So close, yet completely useless.

These aren't edge cases. They're the norm once you move past demo queries. Production RAG requires addressing all three.

Chunking — Smarter Boundaries, Better Retrieval

The first question: how do you decide where to split documents?

Your chunking strategy determines whether meaning stays intact or gets shattered across arbitrary boundaries. Get this wrong, and no amount of fancy retrieval will save you.

Fixed-size chunking splits every N tokens regardless of content. It's simple and predictable, but it cuts through sentences, paragraphs, and ideas without regard for meaning. Use it only for homogeneous, unstructured content where boundaries don't matter.

Recursive chunking works hierarchically — first by sections, then paragraphs, then sentences, then characters. It respects document structure and is the default in LangChain for good reason. Testing shows 85-90% recall at 400-512 tokens with 10-20% overlap. For most use cases, this is the pragmatic choice.

Semantic chunking groups sentences by embedding similarity. Each chunk becomes thematically coherent — a single topic, a complete thought. Testing shows up to 70% accuracy improvement over naive approaches. The tradeoff is computation: you need to embed every sentence and calculate similarity scores to find split points.

Here's how to decide:

Your Situation	Recommended Strategy
Just starting out	Recursive, 400-512 tokens, 10-20% overlap
Retrieval metrics are disappointing	Try semantic chunking
Documents are highly structured (code, legal)	Language-specific or section-aware
Latency is critical	Stick with recursive
Accuracy is paramount	Semantic or page-level

The honest approach: start with recursive chunking. Measure your retrieval metrics. Only optimize when data shows chunking is your bottleneck. The wrong strategy creates up to a 9% gap in recall — significant, but not worth premature optimization.

Hybrid Search — When Vector Isn't Enough

Pure vector search understands meaning beautifully but fails at exact matches.

Consider this scenario: a user searches "What changed in version 3.2.1?" Your documents contain "Version 3.2.1 introduced breaking changes to the API..." — exactly what they need.

Vector search returns generic API documentation. Why? Because "3.2.1" has no semantic meaning to an embedding model. It's just a string of numbers. The model understands "version" and "changes" but can't distinguish 3.2.1 from 3.1.9 or 4.0.0.

Hybrid search solves this by running two searches in parallel: vector search for semantic understanding and keyword search (typically BM25) for exact matches. The results get combined using a fusion algorithm.

How Reciprocal Rank Fusion works:

Each search produces a ranked list. For every document, RRF calculates a score: 1/(rank + k), where k is typically 60. Documents appearing in both lists get combined scores. The final ranking reflects both semantic relevance and keyword precision.

Why k=60? It's been experimentally shown to balance both retrievers without letting either dominate. Lower values favor higher-ranked documents more aggressively; higher values flatten the distribution.

Implementation is straightforward. Weaviate has built-in hybrid search with RRF. Qdrant's Query API supports it natively. Azure AI Search offers native RRF. In LangChain, the EnsembleRetriever lets you weight keyword and semantic retrievers however you want.

When to use hybrid search:

Users search for specific identifiers — version numbers, product codes, legal citations
Documents contain technical terminology that must match exactly
Your domain mixes conversational queries with precise lookups

Production systems using hybrid retrieval report 25% reduced token usage. More relevant chunks means less irrelevant context sent to the LLM.

Reranking — The Second Pass That Changes Everything

Embedding models compress entire documents into single vectors. That compression is what makes fast retrieval possible across millions of documents. But compression loses nuance.

Reranking asks a different question: what if, after fast retrieval, you did a slower but more accurate pass on just the top candidates?

The two-stage architecture:

Stage one is your retriever — fast, retrieves 50-200 candidates from potentially millions of documents. Stage two is your reranker — slower, more precise, reorders those candidates by true relevance.

The key difference is how each processes information. Bi-encoders (your embedding models) process the query and document separately, then compare the resulting vectors. Cross-encoders (rerankers) process the query and document together, seeing the pair as a unit. This lets them recognize relevance that requires understanding both simultaneously.

The numbers are significant. Research from Databricks and Pinecone shows reranking improves retrieval quality by 15-48%. Modern cross-encoders can rerank 50 documents in about 1.5 seconds. The optimal candidate set is 50-75 documents for most applications, up to 200 for comprehensive search scenarios.

Your options:

Cohere Rerank is the most popular hosted option. Their Rerank 4 model has a 32K context window (four times the previous version), supports 100+ languages, and comes in Fast and Pro tiers depending on your latency-accuracy tradeoff.

BGE-reranker is open-source and self-hostable. No API costs, full control, competitive accuracy. Good for teams with the infrastructure to run inference.

When to add reranking:

Retrieval metrics have plateaued and you need more precision
Your domain requires nuanced relevance judgments
You're already doing hybrid search and want the next level
The value of better results justifies the added latency

The Optimization Ladder

These techniques form a progression, not a checklist. Don't start at the top.

Level 1: Basic RAG. Fixed or recursive chunking. Vector search only. Results go directly to the LLM. This is where everyone starts.

Level 2: Better Chunking. Switch to semantic or context-aware chunking. Tune your chunk size and overlap based on your specific documents. Expected improvement: 9-70% in recall, depending on how bad your baseline was.

Level 3: Hybrid Search. Add BM25 alongside vector search. Fuse with RRF. Expected improvement: better handling of exact-match queries, 25% reduction in tokens sent to the LLM.

Level 4: Reranking. Add a cross-encoder reranking stage. Expected improvement: 15-48% in overall retrieval quality.

Level 5: Production Pipeline. All three working together — semantic chunking, hybrid retrieval, reranking. This is what serious production systems look like.

The principle: climb one level at a time. Measure everything. Only add complexity when your metrics show you need it.

When to Optimize

The question isn't whether these techniques work. They do. The question is whether your system needs them yet.

If your RAG returns good results most of the time and users are satisfied, you might not need Level 4 complexity. If you're seeing the failure modes described at the start of this article — chunk boundary problems, exact-match failures, poor ranking — now you know where to look.

Three techniques. Smarter chunking keeps meaning intact. Hybrid search catches what pure vector misses. Reranking orders results by true relevance. Layer them as needed, based on what your data tells you.

The teams that build great RAG systems aren't the ones who implement everything at once. They're the ones who measure, identify bottlenecks, and optimize precisely where it matters.

Your RAG system works. It retrieves documents. It generates answers. Users can ask questions and get responses grounded in your content.

But something's off.

You've built basic RAG. Now you need to make it actually good. That's a different skill set — and most tutorials stop right before this part.

Why Basic RAG Disappoints

Basic RAG has predictable failure modes. Once you've seen them, you'll recognize them everywhere.

Mediocre ranking. The right chunk is retrieved — but it's number seven in the results, and your top_k is set to five. So close, yet completely useless.

These aren't edge cases. They're the norm once you move past demo queries. Production RAG requires addressing all three.

Chunking — Smarter Boundaries, Better Retrieval

The first question: how do you decide where to split documents?

Your chunking strategy determines whether meaning stays intact or gets shattered across arbitrary boundaries. Get this wrong, and no amount of fancy retrieval will save you.

Here's how to decide:

Your Situation	Recommended Strategy
Just starting out	Recursive, 400-512 tokens, 10-20% overlap
Retrieval metrics are disappointing	Try semantic chunking
Documents are highly structured (code, legal)	Language-specific or section-aware
Latency is critical	Stick with recursive
Accuracy is paramount	Semantic or page-level

Hybrid Search — When Vector Isn't Enough

Pure vector search understands meaning beautifully but fails at exact matches.

Consider this scenario: a user searches "What changed in version 3.2.1?" Your documents contain "Version 3.2.1 introduced breaking changes to the API..." — exactly what they need.

How Reciprocal Rank Fusion works:

When to use hybrid search:

Users search for specific identifiers — version numbers, product codes, legal citations
Documents contain technical terminology that must match exactly
Your domain mixes conversational queries with precise lookups

Production systems using hybrid retrieval report 25% reduced token usage. More relevant chunks means less irrelevant context sent to the LLM.

Reranking — The Second Pass That Changes Everything

Embedding models compress entire documents into single vectors. That compression is what makes fast retrieval possible across millions of documents. But compression loses nuance.

Reranking asks a different question: what if, after fast retrieval, you did a slower but more accurate pass on just the top candidates?

The two-stage architecture:

Your options:

BGE-reranker is open-source and self-hostable. No API costs, full control, competitive accuracy. Good for teams with the infrastructure to run inference.

When to add reranking:

Retrieval metrics have plateaued and you need more precision
Your domain requires nuanced relevance judgments
You're already doing hybrid search and want the next level
The value of better results justifies the added latency

The Optimization Ladder

These techniques form a progression, not a checklist. Don't start at the top.

Level 1: Basic RAG. Fixed or recursive chunking. Vector search only. Results go directly to the LLM. This is where everyone starts.

Level 3: Hybrid Search. Add BM25 alongside vector search. Fuse with RRF. Expected improvement: better handling of exact-match queries, 25% reduction in tokens sent to the LLM.

Level 4: Reranking. Add a cross-encoder reranking stage. Expected improvement: 15-48% in overall retrieval quality.

Level 5: Production Pipeline. All three working together — semantic chunking, hybrid retrieval, reranking. This is what serious production systems look like.

The principle: climb one level at a time. Measure everything. Only add complexity when your metrics show you need it.

When to Optimize

The question isn't whether these techniques work. They do. The question is whether your system needs them yet.

The teams that build great RAG systems aren't the ones who implement everything at once. They're the ones who measure, identify bottlenecks, and optimize precisely where it matters.

Beyond Basic RAG: Chunking, Hybrid Search, and Reranking

Why Basic RAG Disappoints

Chunking — Smarter Boundaries, Better Retrieval

Hybrid Search — When Vector Isn't Enough

Reranking — The Second Pass That Changes Everything

The Optimization Ladder

When to Optimize

Rosh Jayawardena

Discussion

Continue Reading

Gaslighting Your AI Into Better Results: What the Research Actually Shows

Microsoft Is So Worried About Claude Code, They're Testing It Against Copilot

The Complete Guide to RAG Chunking: 6 Strategies with Code

Deep dives, delivered weekly

Beyond Basic RAG: Chunking, Hybrid Search, and Reranking

Why Basic RAG Disappoints

Chunking — Smarter Boundaries, Better Retrieval

Hybrid Search — When Vector Isn't Enough

Reranking — The Second Pass That Changes Everything

The Optimization Ladder

When to Optimize

Rosh Jayawardena

Discussion

Continue Reading

Gaslighting Your AI Into Better Results: What the Research Actually Shows

Microsoft Is So Worried About Claude Code, They're Testing It Against Copilot

The Complete Guide to RAG Chunking: 6 Strategies with Code

Deep dives, delivered weekly