Claude 4 Opus Review: Agent Teams Fall Short, Gemini 3.1 Pro Leads

Claude 4.6 just landed, and it's already making waves—generating some controversy. Does it live up to expectations? I spent the last week testing its agent workflows, document processing, and reasoning to see if the hype matches reality.

Beyond the Chatbot: A Play for the Whole Office

Anthropic isn't just aiming to be another chatbot; they're gunning for total workflow automation. Anthropic's availability on major cloud platforms like Amazon Bedrock, Google Cloud's Vertex AI, and Microsoft Foundry suggests they are focused on workflow integration. They are reportedly targeting industries like finance and healthcare, aiming to automate tasks such as data analysis and report generation. The headlining feature? "Agent Teams," promising a squad of AI assistants ready to tackle any project. I wanted to test its capabilities to see if it could improve my team's efficiency.

I tasked the Agent Teams with planning my next vacation—a simple task for a human assistant. I created a trip planner and budget agent, hoping they'd work well together. The agents weren't able to coordinate. The budget agent rejected all of the planner's suggestions. After an hour, I had a log of conflicting recommendations.

It felt clunky and unreliable. The hour-long test cost me nearly $80 in token fees. And because the sessions are temporary, a dropped connection wipes out the whole team. It shows potential, but it's not yet ready for widespread adoption.

Under the Hood: A Million-Token Context Window That Still Misses the Point?

Anthropic is bragging about its million-token context window. Does that massive memory actually improve the experience? I decided to throw a 700-page PDF of NDAs at it. This document is lengthy and technically dense, perfect for testing if the AI can find a needle in a haystack. NDAs are challenging for AI because they use precise legal language and complex conditional clauses.

I asked it two basic questions: who's liable if things go south ('indemnification') and which state's laws apply ('governing law'). Opus 4.6 failed to identify the correct parties liable for indemnification and misidentified the governing state law. Given the expense, the results were surprisingly poor. It failed to identify crucial information I could have easily found using a simple search.

Gemini 3.1 Pro Outperforms Opus 4.6

While OpenAI and Anthropic have been the focus, Google has released impressive benchmarks with the new Gemini 3.1 Pro. The new Gemini 3.1 Pro is crushing benchmarks, scoring a 77.1% on the ARC-AGI-2 abstract reasoning test. And Opus 4.6? Its score isn't public, but its sibling model, Sonnet 4.6, scores 60.4%. Google is significantly outperforming in abstract reasoning tests.

Gemini 3.1 Pro outperforming Opus 4.6 — Gemini 3.1 Pro dominates Opus 4.6 in benchmarks.

The Price of Power: Is Opus 4.6 Worth the Cost?

Let's talk money. Opus smokes Sonnet on the GPQA Diamond benchmark, which is great if you're a PhD physicist. But for coding, writing, and daily tasks, that 17-point lead is irrelevant. The GPQA Diamond benchmark focuses on complex problem-solving in physics, a skill that doesn't directly translate to everyday tasks. Users are unlikely to notice that extra skill. The higher benchmark score has limited practical impact because most users don't require that level of expertise for their daily tasks. On tasks that matter, like using desktop apps (the OSWorld benchmark), Sonnet is almost identical: 72.5% vs. Opus's 72.7%. You won't notice that 0.2% difference.

Here's how I see it:

Opus 4.6: Costs as much as a fancy dinner.
Sonnet 4.6: Costs as much as a coffee.
Gemini 3.1 Pro: Costs as much as a vending machine snack.

Opus 4.6 costs as much as a fancy dinner, while Sonnet 4.6 is comparable to the price of a coffee, and Gemini 3.1 Pro is as cheap as a vending machine snack.

My final take on Opus 4.6...

The marketing for Opus 4.6 sets expectations higher than the product currently delivers. The 'Agent Teams' feature is a cool demo, but it's expensive and buggy. I experienced issues with agent coordination and frequent disconnections. The massive context window failed to accurately answer questions about the NDA I uploaded. For the price, Opus 4.6 doesn't deliver. The real story is its cheaper sibling, Sonnet 4.6, which is nearly identical on the tasks that matter. Save your money and stick with Sonnet.