
Microsoft just told thousands of engineers to install Claude Code and compare it to Copilot. When you're running internal benchmarks against a competitor, you're not confident you're winning.
Microsoft just told thousands of engineers to install Claude Code.
Not as a replacement for Copilot. Alongside it. They want their own engineers to run head-to-head comparisons between the AI coding tool they sell and the one Anthropic built.
Think about what that means for a second.
The company that owns GitHub Copilot - the company that's invested billions in OpenAI (To be fair they've invested a lot in Anthropic too) - is so concerned about Claude Code that they need internal data on how it stacks up. The Verge broke the story: Microsoft's Experiences + Devices division (that's Windows, Office, Teams, Edge, and Surface) was asked to install Claude Code last week. Even designers and project managers are being encouraged to prototype with it.
When you're running internal benchmarks against a competitor's product, you're not confident you're winning.
I build with Claude Code every day. I'm developing an AI-powered notebook app called Onsomble - a Next.js frontend, NestJS backend monorepo with LangGraph workflows and RAG capabilities. Claude Code isn't just another tool in my stack. It's become the way I build software.
And I get why Microsoft needs to know what they're up against.
Here's the thing nobody wants to say out loud: Anthropic's models are better for serious coding work. Not marginally better. Noticeably better.
Even in competitor tools. Fire up Cursor or Windsurf, and watch what model serious engineers choose. It's Sonnet or Opus. Not because of brand loyalty - because of results.
The numbers back this up. Claude Opus 4.5 hit 80.9% on SWE-bench Verified - the first AI to break 80%, currently the world leader. But benchmarks only tell part of the story.
When I started building Onsomble, I tried every model. GPT-4, Gemini, the works. I kept coming back to Anthropic's models for anything non-trivial. The difference isn't marginal. It's the difference between code that "sort of works" and code that actually understands your architecture.
![]()
Stack Overflow's 2025 Developer Survey found Claude Sonnet is used more by professional developers (45%) than by those learning to code (30%). The professionals know.
Here's where it gets uncomfortable for the Copilot team.
A Blind survey from December 2025 asked tech professionals which AI tool they actually use. At Microsoft, 34% of respondents said Claude was their primary tool. Copilot? 32%.
It's not just Microsoft. At Meta, 50% reported Claude as their most-used AI model. Only 8% said Meta AI. At Amazon, 54% chose Claude as their go-to.
![]()
Engineers vote with their keystrokes. And right now, they're voting for Claude.
This is why Microsoft is running internal comparisons. They've seen the survey data. They know their engineers are already using Claude Code on side projects, maybe on company time. The smart move isn't to ban it - it's to understand exactly where Copilot falls short.
Most AI coding tools are great at answering questions about individual files. Ask about a function, get a reasonable answer. But codebases aren't individual files. They're systems.
Claude Code understands systems.
Onsomble is a monorepo with a Next.js frontend and NestJS backend. Change a DTO in the backend, and it affects API calls in the frontend, which affects state management in Zustand stores, which affects how components render. Most tools would give me answers for individual files. Claude Code gives me answers for my system.
This isn't magic. It's context management done right.
The CLAUDE.md system lets you give Claude project-specific context - your architecture decisions, your conventions, your common pitfalls. In a monorepo, you can nest these files: one at the root, one in each major directory. Claude reads them, understands the relationships, and traces dependencies across boundaries.
When I ask Claude Code why a certain API call is failing, it doesn't just look at the endpoint. It traces the DTO definition, checks how the frontend is constructing the request, examines the validation pipe, and tells me exactly where the mismatch is.
I spent three months with other tools before switching to Claude Code. The difference in how it navigates complex codebases is stark. It's not thinking about files - it's thinking about systems.
This is one killer feature that is super under-utilised: you can teach it to think like you.
Claude Skills let you encode your workflows, your debugging approaches, your architectural standards. This isn't just "custom prompts." It's creating specialized agents within Claude Code that follow your exact methodology.
I built a skill for bug investigation. Here's how it works:
This isn't autocomplete. This is pair programming with someone who never forgets your architecture and never gets tired of being systematic.
I have skills for creating components that follow our atomic design system. Skills for writing tests that match our patterns. Skills for planning features with the right level of breakdown.
Over time, Claude Code has become less of a tool and more of a team member that's internalized how we work.
Most engineers I talk to haven't even discovered Skills yet. They're using Claude Code like a better autocomplete. That's like buying a Tesla and only using it for the cup holders.
Microsoft's internal benchmark is a preview of what happens when executives finally pay attention to what their engineers are already telling them.
The engineers found something better and started using it. Leadership's response wasn't to ban the competition - it was to study it. That's the right call. The "let's compare them head-to-head" approach is far smarter than pretending the problem doesn't exist.
The real cost isn't the subscription price. Anthropic's API isn't cheap. The real cost is the productivity delta. Claude Opus 4.5 handles long-horizon coding tasks using up to 65% fewer tokens than previous models while achieving higher pass rates. That's efficiency gains that compound across your entire engineering org.
If your team is gravitating toward a tool you don't sell, that's not betrayal. That's market research delivered directly to your doorstep.
Microsoft understood this. They didn't issue a ban. They issued a benchmarking exercise. Somewhere in Redmond right now, an engineer is filing a report on exactly where Claude Code outperforms Copilot. That report is going to be uncomfortable reading.
Microsoft didn't deploy Claude Code because they think Copilot is winning. They deployed it because they need to know how far behind they are.
I've been building with Claude Code for months now. Every week it gets more embedded in how I work. The model quality keeps improving. The context handling keeps getting smarter. The Skills system lets me compound my workflows over time.
Microsoft's internal test will generate data. But the Blind survey already told us what the engineers think. 34% to 32%. The verdict is in.
The best tool is winning.
I lead data & AI for New Zealand's largest insurer. Before that, 10+ years building enterprise software. I write about AI for people who need to finish things, not just play with tools

A Reddit post about telling Claude you work at a hospital went viral. Turns out there's actual research explaining why this works across all LLMs.

How you split your documents determines whether RAG finds what you need or returns noise. Here's the complete breakdown with code.

Long context windows are getting massive—but that doesn't mean RAG is dead. Here's when each approach actually works, with real numbers.
AI patterns, workflow tips, and lessons from the field. No spam, just signal.