I've read at least a dozen articles comparing RAG to long-context LLMs. They all follow the same pattern: list the pros and cons of each approach, show a comparison table, and conclude with some variation of "it depends on your use case."

That's technically true. And not particularly helpful.

If you're building an AI system that needs to work with documents or a knowledge base, you don't need another balanced overview. You need to know which approach is right for *your* situation. The "RAG is dead" discourse has made this worse, not better. Million-token context windows sound like they should solve everything. Sometimes they do. Often they don't.

Here's what I've found works for me: a decision framework based on five criteria with specific thresholds. Not "consider your data volume" but "if you have more than 100 documents or 100K tokens, use RAG." Not "think about costs" but "RAG costs $0.00008 per query vs $0.10 for long-context."

Let me walk through it.

The 5 Decision Criteria I Keep Coming Back To

Most comparison articles list ten or fifteen factors to consider. That's overwhelming and not particularly helpful. In practice, I've found five criteria drive the decision, and they have clear thresholds that flip the answer from one approach to the other.

The five criteria:

Data volume — How much content are you working with?
Update frequency — How often does your data change?
Latency requirements — What response time do your users expect?
Cost constraints — What's your query volume and budget?
Accuracy needs — Do you need precision or big-picture understanding?

These aren't equal. Data volume is the first filter. If your data doesn't fit in a context window, the decision is already made. The others help you optimise within the viable options.

Criterion 1: How Much Data Are You Working With?

Long-context models have hard limits. Gemini 2.5 Pro's 1 million token window sounds large until you do the maths: that's roughly 3,400 pages of text. A single annual report from a public company runs 300-400 pages. You could fit maybe ten of those. Real enterprise knowledge bases are measured in terabytes, not pages.

The threshold: If you have fewer than 100 documents AND fewer than 100,000 total tokens, long-context is viable. Beyond that, you need RAG.

To estimate your token count: most English text runs about 4 characters per token. A typical document page is ~500 words or ~2,000 characters, which means roughly 500 tokens. So 100K tokens is about 200 pages.

This is the first filter because it's binary. If your data doesn't fit, the other criteria don't matter yet. But if it does fit, keep going — volume alone doesn't determine the best approach.

Criterion 2: How Often Does Your Data Change?

Long-context's main advantage is simplicity. You dump your documents into the prompt and let the model figure it out. No embeddings, no vector databases, no chunking strategies to debug.

That advantage disappears the moment your data changes frequently.

With RAG, you can update your index instantly. A new document gets embedded and becomes searchable in seconds. With long-context, every change means reconstructing your entire prompt. If you're processing support tickets, news feeds, or any data that updates daily, you'll spend more time managing your context than the simplicity saved you.

The threshold: If your data updates daily or more frequently, use RAG. If your data is static or changes quarterly at most, long-context remains viable.

Consider what "updates" means for your use case. A legal document repository that adds new contracts weekly is different from a compliance database that needs real-time regulatory changes. The former could work with either approach; the latter needs RAG.

Criterion 3: What Latency Can You Tolerate?

Here's a number that doesn't get enough attention: in testing by Elastic Labs (who, yes, sell vector search infrastructure, but their methodology was solid), RAG queries averaged 1 second while long-context queries averaged 45 seconds.

That's not a typo. 45 seconds.

The gap exists because processing more tokens takes more time. With RAG, you retrieve the relevant chunks and send maybe 4,000-8,000 tokens to the model. With long-context, you're sending hundreds of thousands of tokens every single time.

The threshold: If this is user-facing (a chat interface, a search feature, anything where a human is waiting), use RAG. If it's batch processing or one-off analysis where you can wait a minute for each query, long-context can work.

Latency also scales non-linearly with context length. At 100K tokens, you might see 10-20 second responses. At 500K tokens, you're looking at 30-60 seconds. At the full million? Some users report response times exceeding a minute.

Criterion 4: What's Your Budget Reality?

The cost difference is substantial. Based on testing with Gemini 2.0-Flash, RAG queries cost approximately $0.00008 each. Long-context queries with full context? About $0.10.

That's 1,250 times more expensive per query.

At low volume, this doesn't matter much. A hundred queries a day at $0.10 is $10 — manageable. But enterprise usage is rarely a hundred queries a day. At 10,000 queries daily, you're looking at $1,000 per day for long-context versus $0.80 for RAG.

The threshold: If you're running more than 1,000 queries per day, RAG's cost advantage becomes significant. If you have low volume and high stakes per query (say, a weekly executive briefing on market conditions), long-context's premium might be worth the simplicity.

There's also the infrastructure cost to consider. Running your own long-context model at 1M tokens requires approximately 40 A10 GPUs for a single user. Most teams don't have that lying around. API-based pricing insulates you from this, but someone is paying for those GPUs.

Criterion 5: What Accuracy Trade-offs Can You Accept?

This is where it gets nuanced. Benchmarks show long-context outperforming RAG overall: 56.3% correct answers versus 49.0% in one comprehensive study. Long-context wins because it can see relationships across the entire document set that RAG might miss.

But there's a catch called "lost in the middle."

Researchers at Stanford and UC Berkeley (Liu et al., 2023) found that LLMs perform best when relevant information is at the beginning or end of the context. When key information sits in the middle, accuracy drops by 30% or more. The model's attention follows a U-shaped curve, focusing on the edges and losing track of the centre.

The threshold: If you need guaranteed accuracy for specific facts — regulatory requirements, contract clauses, technical specifications — use RAG with reranking. RAG retrieves the most relevant chunks and puts them front-and-centre where the model pays attention. If you need to understand how ideas connect across a document set, long-context is better despite the mid-context weakness.

One interesting finding: using just 16,000 carefully selected tokens with RAG scored 44.43 F1, while dumping all 128,000 tokens into the context scored only 34.32. More tokens isn't always better.

The Decision Tree

Here's how this comes together as a flowchart:

Start: How much data?

More than 100 docs or 100K tokens → Use RAG
Less than that → Continue

Filter 2: How often does it change?

Daily or more frequent updates → Use RAG
Static or quarterly updates → Continue

Filter 3: What's your latency requirement?

User-facing, need sub-5-second responses → Use RAG
Batch or async, can wait 30-60 seconds → Continue

Filter 4: What's your query volume?

More than 1,000 queries/day → Use RAG
Low volume, high stakes → Long-context viable or Hybrid

Filter 5: What accuracy do you need?

Precision on specific facts → RAG with reranking
Holistic understanding → Long-context
Both → Hybrid

Here's the quick reference:

Criterion	Use RAG	Use Long-Context
Data volume	>100 docs or >100K tokens	<100 docs AND <100K tokens
Update frequency	Daily or more	Static or quarterly
Latency	User-facing	Batch/async
Query volume	>1,000/day	Low volume
Accuracy	Precision on facts	Big-picture understanding

When to Go Hybrid (And How)

If you made it through all five filters and the answer isn't clear, you probably need both.

The hybrid approach isn't just splitting the difference. It's using each approach for what it does best. One pattern that's worked well: use RAG for initial retrieval to find the most relevant documents, then use long-context to synthesise across those retrieved documents.

Research on "Self-Route" approaches shows promise here. The model evaluates each query and routes it to RAG or long-context based on what the query needs. Factual lookups go to RAG. Synthesis questions go to long-context. This maintains accuracy while managing costs.

Databricks research confirms that longer context windows actually make RAG better, not obsolete. More context means RAG can include more retrieved documents without truncation. The approaches are synergistic, not competing.

When hybrid makes sense: Medium complexity projects where you need both precision and synthesis. Enterprise knowledge management where some queries are "find the policy on X" (RAG) and others are "summarise how our approach to X has evolved" (long-context).

Is RAG Actually Dead?

Short answer: no. But basic RAG is dying.

The pattern of "embed chunks → vector search → dump results into context" is increasingly inadequate for complex queries. Fair enough, that deserves to die. What's replacing it isn't long-context alone. It's a broader discipline called "context engineering."

Context engineering treats what goes into the model's context as something to be optimised, not just filled. Retrieval is one tool. Summarisation is another. Prompt compression, strategic ordering, semantic chunking — all of these matter.

The numbers tell the story: RAG framework usage surged 400% since 2024. 60% of production LLM applications use some form of retrieval. Companies aren't abandoning RAG for long-context; they're making RAG smarter.

The real question isn't "RAG or long-context?" It's "how do I get the right information into the context, formatted in a way the model can use effectively?"

Your Next Step

You now have the framework. But frameworks only matter if you apply them.

Start with two numbers: your data volume and your update frequency. Those two criteria alone will narrow your options to one or two approaches. Then factor in latency, cost, and accuracy needs to make the final call.

The answer to "RAG vs long-context" isn't "it depends." It's "here's what it depends on, and here's how to evaluate your situation." You have the thresholds. Now you can stop reading comparison articles and start building.

That's technically true. And not particularly helpful.

Let me walk through it.

The 5 Decision Criteria I Keep Coming Back To

The five criteria:

Data volume — How much content are you working with?
Update frequency — How often does your data change?
Latency requirements — What response time do your users expect?
Cost constraints — What's your query volume and budget?
Accuracy needs — Do you need precision or big-picture understanding?

These aren't equal. Data volume is the first filter. If your data doesn't fit in a context window, the decision is already made. The others help you optimise within the viable options.

Criterion 1: How Much Data Are You Working With?

The threshold: If you have fewer than 100 documents AND fewer than 100,000 total tokens, long-context is viable. Beyond that, you need RAG.

This is the first filter because it's binary. If your data doesn't fit, the other criteria don't matter yet. But if it does fit, keep going — volume alone doesn't determine the best approach.

Criterion 2: How Often Does Your Data Change?

Long-context's main advantage is simplicity. You dump your documents into the prompt and let the model figure it out. No embeddings, no vector databases, no chunking strategies to debug.

That advantage disappears the moment your data changes frequently.

The threshold: If your data updates daily or more frequently, use RAG. If your data is static or changes quarterly at most, long-context remains viable.

Criterion 3: What Latency Can You Tolerate?

That's not a typo. 45 seconds.

Criterion 4: What's Your Budget Reality?

The cost difference is substantial. Based on testing with Gemini 2.0-Flash, RAG queries cost approximately $0.00008 each. Long-context queries with full context? About $0.10.

That's 1,250 times more expensive per query.

Criterion 5: What Accuracy Trade-offs Can You Accept?

But there's a catch called "lost in the middle."

One interesting finding: using just 16,000 carefully selected tokens with RAG scored 44.43 F1, while dumping all 128,000 tokens into the context scored only 34.32. More tokens isn't always better.

The Decision Tree

Here's how this comes together as a flowchart:

Start: How much data?

More than 100 docs or 100K tokens → Use RAG
Less than that → Continue

Filter 2: How often does it change?

Daily or more frequent updates → Use RAG
Static or quarterly updates → Continue

Filter 3: What's your latency requirement?

User-facing, need sub-5-second responses → Use RAG
Batch or async, can wait 30-60 seconds → Continue

Filter 4: What's your query volume?

More than 1,000 queries/day → Use RAG
Low volume, high stakes → Long-context viable or Hybrid

Filter 5: What accuracy do you need?

Precision on specific facts → RAG with reranking
Holistic understanding → Long-context
Both → Hybrid

Here's the quick reference:

Criterion	Use RAG	Use Long-Context
Data volume	>100 docs or >100K tokens	<100 docs AND <100K tokens
Update frequency	Daily or more	Static or quarterly
Latency	User-facing	Batch/async
Query volume	>1,000/day	Low volume
Accuracy	Precision on facts	Big-picture understanding

When to Go Hybrid (And How)

If you made it through all five filters and the answer isn't clear, you probably need both.

Is RAG Actually Dead?

Short answer: no. But basic RAG is dying.

The real question isn't "RAG or long-context?" It's "how do I get the right information into the context, formatted in a way the model can use effectively?"

Your Next Step

You now have the framework. But frameworks only matter if you apply them.

RAG vs Long-Context LLMs: The Decision Framework That Actually Helps You Choose

The 5 Decision Criteria I Keep Coming Back To

Criterion 1: How Much Data Are You Working With?

Criterion 2: How Often Does Your Data Change?

Criterion 3: What Latency Can You Tolerate?

Criterion 4: What's Your Budget Reality?

Criterion 5: What Accuracy Trade-offs Can You Accept?

The Decision Tree

When to Go Hybrid (And How)

Is RAG Actually Dead?

Your Next Step

Deep dives, delivered weekly

RAG vs Long-Context LLMs: The Decision Framework That Actually Helps You Choose

The 5 Decision Criteria I Keep Coming Back To

Criterion 1: How Much Data Are You Working With?

Criterion 2: How Often Does Your Data Change?

Criterion 3: What Latency Can You Tolerate?

Criterion 4: What's Your Budget Reality?

Criterion 5: What Accuracy Trade-offs Can You Accept?

The Decision Tree

When to Go Hybrid (And How)

Is RAG Actually Dead?

Your Next Step

Deep dives, delivered weekly