Chain-of-Thought

Essential for complex reasoning but diminishing returns on modern reasoning models — the token cost and latency hit mean you should use it selectively, not by default.

LLM·Context

arxiv.org

Our Take

What It Is

Chain-of-thought (CoT) prompting guides LLMs to show their working before answering. Originally introduced by Wei et al. (NeurIPS 2022), it demonstrated dramatic improvements on math, logic, and multi-step problems. The technique has since been absorbed into model architecture itself — OpenAI's o-series models and Claude's extended thinking use CoT internally by default.

Why It Matters

CoT sits at Proven because it's become a foundational concept every practitioner needs to understand, even if explicit CoT prompting is becoming less necessary. The key insight from Wharton's February 2026 study: CoT adds only 2.9-3.1% improvement on reasoning models like o3-mini and o4-mini. For models with built-in reasoning, adding "let's think step by step" is paying for the same work twice.

The practical upshot: use CoT deliberately. For legacy models or tasks requiring interpretable reasoning traces, it's still valuable. For frontier reasoning models, your tokens are better spent elsewhere.

Key Developments

Feb 2026: Wharton study shows CoT adds only 2.9-3.1% improvement on reasoning models (o3-mini, o4-mini).
Jan 2026: AWS publishes Chain-of-Draft on Amazon Bedrock — a more token-efficient alternative to CoT.
Late 2025: Dynamic Recursive CoT (DR-CoT) published in Nature Scientific Reports with voting mechanism.
2025-2026: Multimodal CoT expansion with "Image of Thought" framework for visual reasoning.

What to Watch

Chain-of-Draft and other token-efficient alternatives are the signal. If these approaches deliver comparable accuracy at a fraction of the token cost, explicit CoT becomes a historical technique rather than a current best practice. Watch for reasoning models that let you control thinking depth per request — Amazon Nova 2 and OpenAI already offer this.

Strengths

Proven accuracy gains: Dramatic improvements on GSM8K, arithmetic, and commonsense reasoning benchmarks in the original research.
Zero-shot applicability: Adding "let's think step by step" improves reasoning without requiring examples.
Embedded in frontier models: OpenAI's o-series and Claude's extended thinking have made CoT an architectural feature.
Interpretability: Explicit reasoning traces let developers verify the model's logic path, aiding debugging and trust.

Considerations

Token cost multiplier: CoT increases token consumption 2-4x compared to direct answering. With inference costs dominating 70-90% of LLM expenses, this adds up.
Diminishing returns on reasoning models: Only 2.9-3.1% improvement on o3-mini/o4-mini. For models with built-in reasoning, explicit CoT adds cost with minimal benefit.
Latency penalty: Responses take 35-600% longer. Not suitable for real-time or low-latency applications.
Plausible-but-wrong reasoning: Smaller models can produce coherent chains that reach incorrect conclusions, looking more convincing than a wrong direct answer.

Resources

Articles

Chain-of-Draft on Amazon Bedrockaws.amazon.com

AWS's more token-efficient alternative to CoT.

Papers

Chain-of-Thought Prompting (Original Paper)arxiv.org

Wei et al. — the paper that introduced the technique.

The Decreasing Value of CoT (Wharton)papers.ssrn.com

February 2026 study on diminishing returns with reasoning models.

Documentation

Prompt Engineering Guide: CoTpromptingguide.ai

Practical guide to CoT prompting with examples.

More in Agents & Orchestration

Chain-of-Thought· A2A Protocol· OpenAI Agents SDK· PydanticAI· AI Browser Use· Agentic RAG· CrewAI· Multi-agent Orchestration· OpenClaw· LangGraph· Model Context Protocol· Tool Use / Function Calling

Back to AI Radar