
Most RAG tutorials stop at "it works." This one shows you how to make it work well.
Your RAG system works. It retrieves documents. It generates answers. Users can ask questions and get responses grounded in your content.
But something's off.
The results are mediocre. Sometimes it returns irrelevant chunks. Sometimes it misses the exact document you know contains the answer. Sometimes the response is technically correct but pulls from the wrong section entirely.
You've built basic RAG. Now you need to make it actually good. That's a different skill set — and most tutorials stop right before this part.
This article covers three techniques that take RAG from functional to genuinely good: smarter chunking, hybrid search, and reranking. These aren't theoretical improvements. They're the optimizations that production systems actually use.
Basic RAG has predictable failure modes. Once you've seen them, you'll recognize them everywhere.
Chunk boundary failures. The answer exists in your documents, but it got split across two chunks. Neither chunk alone is relevant enough to retrieve. The user gets a partial answer or no answer at all.
Semantic-but-not-exact misses. User searches for "PostgreSQL 17 performance improvements." Vector search returns generic database performance content because it understands the semantics of "performance" but doesn't weight the exact version number. The specific document mentioning 17 never surfaces.
Mediocre ranking. The right chunk is retrieved — but it's number seven in the results, and your top_k is set to five. So close, yet completely useless.
These aren't edge cases. They're the norm once you move past demo queries. Production RAG requires addressing all three.
The first question: how do you decide where to split documents?
Your chunking strategy determines whether meaning stays intact or gets shattered across arbitrary boundaries. Get this wrong, and no amount of fancy retrieval will save you.
Fixed-size chunking splits every N tokens regardless of content. It's simple and predictable, but it cuts through sentences, paragraphs, and ideas without regard for meaning. Use it only for homogeneous, unstructured content where boundaries don't matter.
Recursive chunking works hierarchically — first by sections, then paragraphs, then sentences, then characters. It respects document structure and is the default in LangChain for good reason. Testing shows 85-90% recall at 400-512 tokens with 10-20% overlap. For most use cases, this is the pragmatic choice.
Semantic chunking groups sentences by embedding similarity. Each chunk becomes thematically coherent — a single topic, a complete thought. Testing shows up to 70% accuracy improvement over naive approaches. The tradeoff is computation: you need to embed every sentence and calculate similarity scores to find split points.
Here's how to decide:
| Your Situation | Recommended Strategy |
|---|---|
| Just starting out | Recursive, 400-512 tokens, 10-20% overlap |
| Retrieval metrics are disappointing | Try semantic chunking |
| Documents are highly structured (code, legal) | Language-specific or section-aware |
| Latency is critical | Stick with recursive |
| Accuracy is paramount | Semantic or page-level |
The honest approach: start with recursive chunking. Measure your retrieval metrics. Only optimize when data shows chunking is your bottleneck. The wrong strategy creates up to a 9% gap in recall — significant, but not worth premature optimization.
![]()
Pure vector search understands meaning beautifully but fails at exact matches.
Consider this scenario: a user searches "What changed in version 3.2.1?" Your documents contain "Version 3.2.1 introduced breaking changes to the API..." — exactly what they need.
Vector search returns generic API documentation. Why? Because "3.2.1" has no semantic meaning to an embedding model. It's just a string of numbers. The model understands "version" and "changes" but can't distinguish 3.2.1 from 3.1.9 or 4.0.0.
Hybrid search solves this by running two searches in parallel: vector search for semantic understanding and keyword search (typically BM25) for exact matches. The results get combined using a fusion algorithm.
How Reciprocal Rank Fusion works:
Each search produces a ranked list. For every document, RRF calculates a score: 1/(rank + k), where k is typically 60. Documents appearing in both lists get combined scores. The final ranking reflects both semantic relevance and keyword precision.
Why k=60? It's been experimentally shown to balance both retrievers without letting either dominate. Lower values favor higher-ranked documents more aggressively; higher values flatten the distribution.
Implementation is straightforward. Weaviate has built-in hybrid search with RRF. Qdrant's Query API supports it natively. Azure AI Search offers native RRF. In LangChain, the EnsembleRetriever lets you weight keyword and semantic retrievers however you want.
When to use hybrid search:
Production systems using hybrid retrieval report 25% reduced token usage. More relevant chunks means less irrelevant context sent to the LLM.
Embedding models compress entire documents into single vectors. That compression is what makes fast retrieval possible across millions of documents. But compression loses nuance.
Reranking asks a different question: what if, after fast retrieval, you did a slower but more accurate pass on just the top candidates?
The two-stage architecture:
Stage one is your retriever — fast, retrieves 50-200 candidates from potentially millions of documents. Stage two is your reranker — slower, more precise, reorders those candidates by true relevance.
The key difference is how each processes information. Bi-encoders (your embedding models) process the query and document separately, then compare the resulting vectors. Cross-encoders (rerankers) process the query and document together, seeing the pair as a unit. This lets them recognize relevance that requires understanding both simultaneously.
The numbers are significant. Research from Databricks and Pinecone shows reranking improves retrieval quality by 15-48%. Modern cross-encoders can rerank 50 documents in about 1.5 seconds. The optimal candidate set is 50-75 documents for most applications, up to 200 for comprehensive search scenarios.
Your options:
Cohere Rerank is the most popular hosted option. Their Rerank 4 model has a 32K context window (four times the previous version), supports 100+ languages, and comes in Fast and Pro tiers depending on your latency-accuracy tradeoff.
BGE-reranker is open-source and self-hostable. No API costs, full control, competitive accuracy. Good for teams with the infrastructure to run inference.
When to add reranking:
These techniques form a progression, not a checklist. Don't start at the top.
![]()
Level 1: Basic RAG. Fixed or recursive chunking. Vector search only. Results go directly to the LLM. This is where everyone starts.
Level 2: Better Chunking. Switch to semantic or context-aware chunking. Tune your chunk size and overlap based on your specific documents. Expected improvement: 9-70% in recall, depending on how bad your baseline was.
Level 3: Hybrid Search. Add BM25 alongside vector search. Fuse with RRF. Expected improvement: better handling of exact-match queries, 25% reduction in tokens sent to the LLM.
Level 4: Reranking. Add a cross-encoder reranking stage. Expected improvement: 15-48% in overall retrieval quality.
Level 5: Production Pipeline. All three working together — semantic chunking, hybrid retrieval, reranking. This is what serious production systems look like.
The principle: climb one level at a time. Measure everything. Only add complexity when your metrics show you need it.
The question isn't whether these techniques work. They do. The question is whether your system needs them yet.
If your RAG returns good results most of the time and users are satisfied, you might not need Level 4 complexity. If you're seeing the failure modes described at the start of this article — chunk boundary problems, exact-match failures, poor ranking — now you know where to look.
Three techniques. Smarter chunking keeps meaning intact. Hybrid search catches what pure vector misses. Reranking orders results by true relevance. Layer them as needed, based on what your data tells you.
The teams that build great RAG systems aren't the ones who implement everything at once. They're the ones who measure, identify bottlenecks, and optimize precisely where it matters.
I lead data & AI for New Zealand's largest insurer. Before that, 10+ years building enterprise software. I write about AI for people who need to finish things, not just play with tools

A Reddit post about telling Claude you work at a hospital went viral. Turns out there's actual research explaining why this works across all LLMs.

Microsoft just told thousands of engineers to install Claude Code and compare it to Copilot. When you're running internal benchmarks against a competitor, you're not confident you're winning.

How you split your documents determines whether RAG finds what you need or returns noise. Here's the complete breakdown with code.
AI patterns, workflow tips, and lessons from the field. No spam, just signal.