
Every RAG vs long-context article ends with "it depends." This one gives you the specific thresholds to make the decision yourself.
Most comparison articles list ten or fifteen factors to consider. That's overwhelming and not particularly helpful. In practice, I've found five criteria drive the decision, and they have clear thresholds that flip the answer from one approach to the other.
The five criteria:
These aren't equal. Data volume is the first filter. If your data doesn't fit in a context window, the decision is already made. The others help you optimise within the viable options.
Long-context models have hard limits. Gemini 2.5 Pro's 1 million token window sounds large until you do the maths: that's roughly 3,400 pages of text. A single annual report from a public company runs 300-400 pages. You could fit maybe ten of those. Real enterprise knowledge bases are measured in terabytes, not pages.
The threshold: If you have fewer than 100 documents AND fewer than 100,000 total tokens, long-context is viable. Beyond that, you need RAG.
To estimate your token count: most English text runs about 4 characters per token. A typical document page is ~500 words or ~2,000 characters, which means roughly 500 tokens. So 100K tokens is about 200 pages.
This is the first filter because it's binary. If your data doesn't fit, the other criteria don't matter yet. But if it does fit, keep going — volume alone doesn't determine the best approach.
Long-context's main advantage is simplicity. You dump your documents into the prompt and let the model figure it out. No embeddings, no vector databases, no chunking strategies to debug.
That advantage disappears the moment your data changes frequently.
With RAG, you can update your index instantly. A new document gets embedded and becomes searchable in seconds. With long-context, every change means reconstructing your entire prompt. If you're processing support tickets, news feeds, or any data that updates daily, you'll spend more time managing your context than the simplicity saved you.
The threshold: If your data updates daily or more frequently, use RAG. If your data is static or changes quarterly at most, long-context remains viable.
Consider what "updates" means for your use case. A legal document repository that adds new contracts weekly is different from a compliance database that needs real-time regulatory changes. The former could work with either approach; the latter needs RAG.
Here's a number that doesn't get enough attention: in testing by Elastic Labs (who, yes, sell vector search infrastructure, but their methodology was solid), RAG queries averaged 1 second while long-context queries averaged 45 seconds.
That's not a typo. 45 seconds.
The gap exists because processing more tokens takes more time. With RAG, you retrieve the relevant chunks and send maybe 4,000-8,000 tokens to the model. With long-context, you're sending hundreds of thousands of tokens every single time.
The threshold: If this is user-facing (a chat interface, a search feature, anything where a human is waiting), use RAG. If it's batch processing or one-off analysis where you can wait a minute for each query, long-context can work.
Latency also scales non-linearly with context length. At 100K tokens, you might see 10-20 second responses. At 500K tokens, you're looking at 30-60 seconds. At the full million? Some users report response times exceeding a minute.
The threshold: If you're running more than 1,000 queries per day, RAG's cost advantage becomes significant. If you have low volume and high stakes per query (say, a weekly executive briefing on market conditions), long-context's premium might be worth the simplicity.
![]()
There's also the infrastructure cost to consider. Running your own long-context model at 1M tokens requires approximately 40 A10 GPUs for a single user. Most teams don't have that lying around. API-based pricing insulates you from this, but someone is paying for those GPUs.
This is where it gets nuanced. Benchmarks show long-context outperforming RAG overall: 56.3% correct answers versus 49.0% in one comprehensive study. Long-context wins because it can see relationships across the entire document set that RAG might miss.
But there's a catch called "lost in the middle."
Researchers at Stanford and UC Berkeley (Liu et al., 2023) found that LLMs perform best when relevant information is at the beginning or end of the context. When key information sits in the middle, accuracy drops by 30% or more. The model's attention follows a U-shaped curve, focusing on the edges and losing track of the centre.
The threshold: If you need guaranteed accuracy for specific facts — regulatory requirements, contract clauses, technical specifications — use RAG with reranking. RAG retrieves the most relevant chunks and puts them front-and-centre where the model pays attention. If you need to understand how ideas connect across a document set, long-context is better despite the mid-context weakness.
One interesting finding: using just 16,000 carefully selected tokens with RAG scored 44.43 F1, while dumping all 128,000 tokens into the context scored only 34.32. More tokens isn't always better.
Here's how this comes together as a flowchart:
Start: How much data?
Filter 2: How often does it change?
Filter 3: What's your latency requirement?
Filter 4: What's your query volume?
Filter 5: What accuracy do you need?
Here's the quick reference:
| Criterion | Use RAG | Use Long-Context |
|---|---|---|
| Data volume | >100 docs or >100K tokens | <100 docs AND <100K tokens |
| Update frequency | Daily or more | Static or quarterly |
| Latency | User-facing | Batch/async |
| Query volume | >1,000/day | Low volume |
| Accuracy | Precision on facts | Big-picture understanding |
If you made it through all five filters and the answer isn't clear, you probably need both.
The hybrid approach isn't just splitting the difference. It's using each approach for what it does best. One pattern that's worked well: use RAG for initial retrieval to find the most relevant documents, then use long-context to synthesise across those retrieved documents.
Research on "Self-Route" approaches shows promise here. The model evaluates each query and routes it to RAG or long-context based on what the query needs. Factual lookups go to RAG. Synthesis questions go to long-context. This maintains accuracy while managing costs.
Databricks research confirms that longer context windows actually make RAG better, not obsolete. More context means RAG can include more retrieved documents without truncation. The approaches are synergistic, not competing.
When hybrid makes sense: Medium complexity projects where you need both precision and synthesis. Enterprise knowledge management where some queries are "find the policy on X" (RAG) and others are "summarise how our approach to X has evolved" (long-context).
Short answer: no. But basic RAG is dying.
The pattern of "embed chunks → vector search → dump results into context" is increasingly inadequate for complex queries. Fair enough, that deserves to die. What's replacing it isn't long-context alone. It's a broader discipline called "context engineering."
Context engineering treats what goes into the model's context as something to be optimised, not just filled. Retrieval is one tool. Summarisation is another. Prompt compression, strategic ordering, semantic chunking — all of these matter.
The numbers tell the story: RAG framework usage surged 400% since 2024. 60% of production LLM applications use some form of retrieval. Companies aren't abandoning RAG for long-context; they're making RAG smarter.
The real question isn't "RAG or long-context?" It's "how do I get the right information into the context, formatted in a way the model can use effectively?"
You now have the framework. But frameworks only matter if you apply them.
Start with two numbers: your data volume and your update frequency. Those two criteria alone will narrow your options to one or two approaches. Then factor in latency, cost, and accuracy needs to make the final call.
The answer to "RAG vs long-context" isn't "it depends." It's "here's what it depends on, and here's how to evaluate your situation." You have the thresholds. Now you can stop reading comparison articles and start building.
I lead data & AI for New Zealand's largest insurer. Before that, 10+ years building enterprise software. I write about AI for people who need to finish things, not just play with tools
AI patterns, workflow tips, and lessons from the field. No spam, just signal.