Skip to content

Claude Opus 4.6 vs. 4.7: The Upgrade That Isn’t Free

When Anthropic announced Claude Opus 4.7 on April 16, 2026, just 70 days after Opus 4.6 shipped, the headline seemed almost too good to be true.

The same $5/$25 per million token pricing. The same 1 million token context window. But better performance, sharper vision, and stronger coding abilities.

As it turns out, the fine print matters. Opus 4.7 is a genuinely better model for complex, agentic tasks. But Anthropic quietly introduced something developers are already calling “token inflation,” and it’s changing the cost calculus for anyone running these models in production.

Let’s break down what actually changed between Opus 4.6 and 4.7, and what it means for your budget and workflows.

The Quick Summary

FeaturesClaude Opus 4.6Claude Opus 4.7
Release DateFebruary 5, 2026April 16, 2026
Sticker Price$5 / $25 per MTok$5 / $25 per MTok
Context Window1M tokens1M tokens
Max Output128k tokens128k tokens
TokenizerPrevious versionUpdated (1.0–1.35× more tokens)
Vision Resolution1,568px / 1.15MP2,576px / 3.75MP
Effort Levelslow, medium, high, maxlow, medium, high, xhigh, max
temperature/top_p/top_kSupportedRemoved (returns 400 error)
Thinking ModeEnabled with budget tokensAdaptive only, off by default

What Got Better: The Real Improvements

1. Software Engineering Gains

Opus 4.7 shows substantial improvements on real-world coding benchmarks, making it Anthropic’s most capable model for autonomous development work.

BenchmarkOpus 4.6Opus 4.7Improvement
SWE-bench Verified80.8%87.6%+6.8 points
SWE-bench Pro53.4%64.3%+10.9 points
CursorBench58%70%+12 points
Rakuten-SWE-BenchBaseline3× more tasks resolved200% increase

The CursorBench jump from 58% to 70% is particularly meaningful, it measures a model’s ability to perform autonomous multi-file edits inside an IDE.

For teams building AI coding agents, this is the difference between a model that needs constant supervision and one that can actually ship work.

2. Vision: A Transformative Upgrade

Opus 4.7 can now accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times the resolution of Opus 4.6.

The real-world impact shows up in benchmark results. XBOW, which builds autonomous penetration testing tools, reported their visual acuity benchmark jumped from 54.5% on Opus 4.6 to 98.5% on Opus 4.7.

That’s not incremental, it’s the difference between a model that can’t reliably read dense UI screenshots and one that absolutely can.

3. Instruction Following: Literal vs. Loose

Where Opus 4.6 sometimes interpreted instructions loosely or skipped parts of complex requests, Opus 4.7 takes instructions literally and completely.

It also verifies its own outputs before reporting back, reducing those “I’ve implemented the change” replies that turn out to be wrong at review time.

Migration warning: If your prompts were tuned for Opus 4.6’s looser behavior, you’ll likely need to re-tune them. This model is more precise now, but it means existing workflows may produce unexpected results until adjusted.

4. New xhigh Effort Level

Opus 4.7 introduces a new effort level called xhigh, positioned between high and max. Claude Code’s default effort was raised to xhigh for all plans on release day.

For most coding and agentic tasks, Anthropic recommends starting with high or xhigh. max has diminishing returns and can lead to overthinking.

The Token Inflation Problem

Here’s where things get complicated.

What Changed

Opus 4.7 uses an updated tokenizer that processes text differently than Opus 4.6 did. The same input text now maps to 1.0 to 1.35× more tokens, depending on content type. For dense code or system prompts, the increase can be even higher.

Simon Willison ran the Opus 4.7 system prompt through both tokenizers and found the 4.7 version used 7,335 tokens vs. 5,039 on 4.6, a 1.46× multiplier.

What This Means for Your Bill

The sticker price hasn’t changed: $5 per million input tokens, $25 per million output. But your effective cost per task can rise significantly:

  • Text-heavy prompts: Up to 35% more expensive
  • Dense code prompts: Closer to 35–46% more expensive
  • High-resolution images: Up to 3× more tokens (though you can downsample to control costs)

User-compiled data from the Tokenomics tool shows the average token increase across real-world prompts is around 38.6%.

Output Tokens: The Double Hit

Output tokens are five times more expensive than input tokens ($25 vs. $5 per million). Opus 4.7 also “thinks more” before responding, especially at higher effort levels, generating more output tokens on top of the input token inflation.

Breaking Changes: What Stops Working

If you’re migrating from Opus 4.6 to 4.7, these changes will break existing code unless updated:

1. Extended Thinking Payloads

Opus 4.6 format:

python

thinking={"type": “enabled”, "budget_tokens": 10000}

Opus 4.7 format:

python

thinking={"type": “adaptive”, “effort”: “high”}

2. Sampling Parameters Removed

Setting temperature, top_p, or top_k to any non-default value now returns a 400 error. Remove these parameters entirely and use prompting to guide behavior instead.

3. Thinking Content Hidden by Default

Opus 4.7 still performs chain-of-thought reasoning, but the visible text is omitted unless you explicitly opt in:

python

thinking={"type": “adaptive”, “effort”: “high”, “display”: “summarized”}

Benchmark Comparison: Full Table

BenchmarkOpus 4.6Opus 4.7Notes
SWE-bench Verified80.8%87.6%+6.8 points
SWE-bench Pro53.4%64.3%+10.9 points
Terminal-Bench 2.065.4%69.4%+4 points
CursorBench58%70%+12 points
MCP-Atlas (tool use)75.8%77.3%+1.5 points
OSWorld-Verified72.7%78.0%+5.3 points
Finance Agent v1.160.1%64.4%+4.3 points
GPQA Diamond91.3%94.2%+2.9 points
CharXiv Reasoning (vision)84.7%91.0%+6.3 points

Sources: Anthropic, Cursor, Rakuten, Harvey, Databricks

System Prompt Changes

Anthropic publishes their Claude.ai system prompts, and the diff between 4.6 and 4.7 reveals some interesting shifts:

Added:

  • Claude in Chrome (browsing agent), Claude in Excel, Claude in PowerPoint
  • Expanded child safety section with critical instruction tags
  • Tool search mechanism: models now call tool search before claiming they lack a capability
  • Guidance to be less verbose and more concise

Removed:

  • The explicit note that “Donald Trump is the current president” (the 4.7 model’s knowledge cut-off is January 2026, making this unnecessary)
  • Instructions to avoid saying “genuinely,” “honestly,” or “straightforward”
  • The section about avoiding emotes or asterisk actions

Migration Strategy: How to Move to 4.7

Step 1: Measure Your Actual Token Inflation

Don’t rely on the 1.0–1.35× range. Run a representative sample of your actual production prompts through both tokenizers to calculate your real multiplier.

Step 2: Update API Calls

  • Remove temperature, top_p, and top_k parameters
  • Update thinking payloads to the new adaptive format
  • Explicitly enable thinking display if your product shows reasoning traces

Step 3: Re-tune Your Prompts

Opus 4.7 takes instructions literally. If your prompts relied on the model “filling in the blanks,” add more explicit guidance.

Step 4: Start with Staged Rollout

Swap a small percentage of coding traffic to claude-opus-4-7, re-run your eval suite, measure token deltas alongside quality metrics, then promote gradually.

Step 5: Consider Keeping Opus 4.6 as a Fallback

Given the breaking API changes, decouple your application logic from specific model versions so you can switch between 4.6 and 4.7 with a single parameter change.

Which Model Should You Use?

Use Opus 4.7 if:

  • You’re building autonomous coding agents (the SWE-bench gains are real).
  • You need high-resolution image understanding (UI screenshots, diagrams, dense dashboards).
  • Your prompts are already well-structured and you want more literal instruction following.
  • You can absorb a 20–40% effective cost increase for better quality.

Stick with Opus 4.6 if:

  • You have tight token budgets that can’t accommodate 30%+ inflation.
  • Your prompts rely on loose interpretation (you haven’t re-tuned for 4.7’s literalness).
  • You need temperature or other sampling parameters.
  • Your use case doesn’t need the vision or coding improvements (e.g., simple chat, basic document Q&A).

The Bottom Line

Opus 4.7 is a genuinely better model, especially for software engineering, vision tasks, and long-running agentic workflows. The 87.6% on SWE-bench Verified and the 3× vision resolution upgrade are meaningful, not marketing hype.

But “same price” is misleading. Between the tokenizer inflation (up to 35% more input tokens) and the model’s tendency to “think more” (more output tokens), your effective cost per task could rise 30–50% in practice.

Anthropic has delivered better capability at the same per-token price, but completing any given task now requires more tokens. Your decision hinges on whether the quality gains justify the higher per-task expense.

For teams building production coding agents, the answer is likely yes, the 12-point gain on CursorBench and 3× production task resolution on Rakuten-SWE-Bench justify the cost. For simpler workloads or teams on tight budgets, Opus 4.6 remains a perfectly capable option.

Kevin James

Kevin James

I'm Kevin James, and I'm passionate about writing on Security and cybersecurity topics. Here, I'd like to share a bit more about myself.I hold a Bachelor of Science in Cybersecurity from Utica College, New York, which has been the foundation of my career in cybersecurity.As a writer, I have the privilege of sharing my insights and knowledge on a wide range of cybersecurity topics. You'll find my articles here at Cybersecurityforme.com, covering the latest trends, threats, and solutions in the field.