PromisingModels & PlatformsNo changeMarch 2026 Backfill

Strong signal and real results. Worth committing a pilot to.

Reasoning Models

Step-change improvements on hard problems, but overthinking inflates costs 5-8x — match model capability to task difficulty instead of using reasoning everywhere.

LLM·Infrastructure

openai.com

Our Take

What It Is

Reasoning models spend extra compute at inference time to "think" through problems step by step before answering. OpenAI o3, DeepSeek R1-0528, xAI Grok 4.20 Beta, and Amazon Nova 2 all ship with this capability. Some offer developer controls for thinking effort — you can dial up reasoning for hard problems and dial it down for simple ones.

Why It Matters

Reasoning models are Promising because they deliver genuine step-change improvements on hard benchmarks (o3 sets SOTA on AIME, Codeforces, SWE-bench) but the cost-quality trade-off is still being worked out. The 5-8x cost multiplier from reasoning chains means using these models for simple tasks wastes compute with no quality gain.

The overthinking problem is real. Research shows reasoning models overthink 3x more often than non-reasoning models, and increasing overthinking correlates with 7.9% decreased task resolution per unit increase. The right approach is matching model capability to task difficulty — not using reasoning everywhere.

Key Developments

  • Mar 2026: xAI Grok 4.20 Beta with 78% non-hallucination rate (industry record per Artificial Analysis).
  • Nov 2025: Grok 4.1 claimed #1 on LMArena Elo ranking (1483) and EQ-Bench.
  • Dec 2025: Amazon Nova 2 with developer-controllable thinking effort.
  • Ongoing: Inference-time scaling established as dominant research direction for 2026.

What to Watch

Controllable thinking budgets are the feature to track. If developers can reliably specify "spend X tokens thinking about this" and get predictable cost-quality trade-offs, reasoning models become practical for production. The category leader changes every few months (o3, Grok 4.1, Grok 4.20) — don't build infrastructure around any single model.

Strengths

  • Step-change on hard benchmarks: o3 sets SOTA on AIME, Codeforces, SWE-bench. 20% fewer major errors than o1 in external evaluations.
  • Open-source options competitive: DeepSeek R1 and QwQ-32B match or approach frontier proprietary models.
  • Controllable thinking budgets: Amazon Nova 2 and OpenAI offer developer controls for reasoning effort per request.
  • Every major lab participating: OpenAI, Google, Anthropic, DeepSeek, xAI, Amazon, Alibaba — broad ecosystem reduces vendor risk.

Considerations

  • Cost and latency multiplier: Reasoning chains inflate token usage 5-8x. Using reasoning for simple tasks wastes compute with no quality gain.
  • Overthinking problem: Models overthink 3x more often than non-reasoning models. Correlates with 7.9% decreased task resolution.
  • Benchmark saturation vs real-world gaps: Excel at contest-style problems but improvements on messy, ambiguous tasks are more modest.
  • Rapidly evolving landscape: Category leader changes every few months. Building infrastructure around any single model is risky.