When Anthropic announced Claude Opus 4.7 on April 16, 2026, just 70 days after Opus 4.6 shipped, the headline seemed almost too good to be true.
The same $5/$25 per million token pricing. The same 1 million token context window. But better performance, sharper vision, and stronger coding abilities.
As it turns out, the fine print matters. Opus 4.7 is a genuinely better model for complex, agentic tasks. But Anthropic quietly introduced something developers are already calling “token inflation,” and it’s changing the cost calculus for anyone running these models in production.
Let’s break down what actually changed between Opus 4.6 and 4.7, and what it means for your budget and workflows.
The Quick Summary
| Features | Claude Opus 4.6 | Claude Opus 4.7 |
| Release Date | February 5, 2026 | April 16, 2026 |
| Sticker Price | $5 / $25 per MTok | $5 / $25 per MTok |
| Context Window | 1M tokens | 1M tokens |
| Max Output | 128k tokens | 128k tokens |
| Tokenizer | Previous version | Updated (1.0–1.35× more tokens) |
| Vision Resolution | 1,568px / 1.15MP | 2,576px / 3.75MP |
| Effort Levels | low, medium, high, max | low, medium, high, xhigh, max |
| temperature/top_p/top_k | Supported | Removed (returns 400 error) |
| Thinking Mode | Enabled with budget tokens | Adaptive only, off by default |
What Got Better: The Real Improvements
1. Software Engineering Gains
Opus 4.7 shows substantial improvements on real-world coding benchmarks, making it Anthropic’s most capable model for autonomous development work.
| Benchmark | Opus 4.6 | Opus 4.7 | Improvement |
| SWE-bench Verified | 80.8% | 87.6% | +6.8 points |
| SWE-bench Pro | 53.4% | 64.3% | +10.9 points |
| CursorBench | 58% | 70% | +12 points |
| Rakuten-SWE-Bench | Baseline | 3× more tasks resolved | 200% increase |
The CursorBench jump from 58% to 70% is particularly meaningful, it measures a model’s ability to perform autonomous multi-file edits inside an IDE.
For teams building AI coding agents, this is the difference between a model that needs constant supervision and one that can actually ship work.
2. Vision: A Transformative Upgrade
Opus 4.7 can now accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times the resolution of Opus 4.6.
The real-world impact shows up in benchmark results. XBOW, which builds autonomous penetration testing tools, reported their visual acuity benchmark jumped from 54.5% on Opus 4.6 to 98.5% on Opus 4.7.
That’s not incremental, it’s the difference between a model that can’t reliably read dense UI screenshots and one that absolutely can.
3. Instruction Following: Literal vs. Loose
Where Opus 4.6 sometimes interpreted instructions loosely or skipped parts of complex requests, Opus 4.7 takes instructions literally and completely.
It also verifies its own outputs before reporting back, reducing those “I’ve implemented the change” replies that turn out to be wrong at review time.
Migration warning: If your prompts were tuned for Opus 4.6’s looser behavior, you’ll likely need to re-tune them. This model is more precise now, but it means existing workflows may produce unexpected results until adjusted.
4. New xhigh Effort Level
Opus 4.7 introduces a new effort level called xhigh, positioned between high and max. Claude Code’s default effort was raised to xhigh for all plans on release day.
For most coding and agentic tasks, Anthropic recommends starting with high or xhigh. max has diminishing returns and can lead to overthinking.
The Token Inflation Problem
Here’s where things get complicated.
What Changed
Opus 4.7 uses an updated tokenizer that processes text differently than Opus 4.6 did. The same input text now maps to 1.0 to 1.35× more tokens, depending on content type. For dense code or system prompts, the increase can be even higher.
Simon Willison ran the Opus 4.7 system prompt through both tokenizers and found the 4.7 version used 7,335 tokens vs. 5,039 on 4.6, a 1.46× multiplier.
What This Means for Your Bill
The sticker price hasn’t changed: $5 per million input tokens, $25 per million output. But your effective cost per task can rise significantly:
- Text-heavy prompts: Up to 35% more expensive
- Dense code prompts: Closer to 35–46% more expensive
- High-resolution images: Up to 3× more tokens (though you can downsample to control costs)
User-compiled data from the Tokenomics tool shows the average token increase across real-world prompts is around 38.6%.
Output Tokens: The Double Hit
Output tokens are five times more expensive than input tokens ($25 vs. $5 per million). Opus 4.7 also “thinks more” before responding, especially at higher effort levels, generating more output tokens on top of the input token inflation.
Breaking Changes: What Stops Working
If you’re migrating from Opus 4.6 to 4.7, these changes will break existing code unless updated:
1. Extended Thinking Payloads
Opus 4.6 format:
python
thinking={"type": “enabled”, "budget_tokens": 10000}
Opus 4.7 format:
python
thinking={"type": “adaptive”, “effort”: “high”}
2. Sampling Parameters Removed
Setting temperature, top_p, or top_k to any non-default value now returns a 400 error. Remove these parameters entirely and use prompting to guide behavior instead.
3. Thinking Content Hidden by Default
Opus 4.7 still performs chain-of-thought reasoning, but the visible text is omitted unless you explicitly opt in:
python
thinking={"type": “adaptive”, “effort”: “high”, “display”: “summarized”}
Benchmark Comparison: Full Table
| Benchmark | Opus 4.6 | Opus 4.7 | Notes |
| SWE-bench Verified | 80.8% | 87.6% | +6.8 points |
| SWE-bench Pro | 53.4% | 64.3% | +10.9 points |
| Terminal-Bench 2.0 | 65.4% | 69.4% | +4 points |
| CursorBench | 58% | 70% | +12 points |
| MCP-Atlas (tool use) | 75.8% | 77.3% | +1.5 points |
| OSWorld-Verified | 72.7% | 78.0% | +5.3 points |
| Finance Agent v1.1 | 60.1% | 64.4% | +4.3 points |
| GPQA Diamond | 91.3% | 94.2% | +2.9 points |
| CharXiv Reasoning (vision) | 84.7% | 91.0% | +6.3 points |
Sources: Anthropic, Cursor, Rakuten, Harvey, Databricks
System Prompt Changes
Anthropic publishes their Claude.ai system prompts, and the diff between 4.6 and 4.7 reveals some interesting shifts:
Added:
- Claude in Chrome (browsing agent), Claude in Excel, Claude in PowerPoint
- Expanded child safety section with critical instruction tags
- Tool search mechanism: models now call
tool searchbefore claiming they lack a capability - Guidance to be less verbose and more concise
Removed:
- The explicit note that “Donald Trump is the current president” (the 4.7 model’s knowledge cut-off is January 2026, making this unnecessary)
- Instructions to avoid saying “genuinely,” “honestly,” or “straightforward”
- The section about avoiding emotes or asterisk actions
Migration Strategy: How to Move to 4.7
Step 1: Measure Your Actual Token Inflation
Don’t rely on the 1.0–1.35× range. Run a representative sample of your actual production prompts through both tokenizers to calculate your real multiplier.
Step 2: Update API Calls
- Remove
temperature,top_p, andtop_kparameters - Update thinking payloads to the new adaptive format
- Explicitly enable thinking display if your product shows reasoning traces
Step 3: Re-tune Your Prompts
Opus 4.7 takes instructions literally. If your prompts relied on the model “filling in the blanks,” add more explicit guidance.
Step 4: Start with Staged Rollout
Swap a small percentage of coding traffic to claude-opus-4-7, re-run your eval suite, measure token deltas alongside quality metrics, then promote gradually.
Step 5: Consider Keeping Opus 4.6 as a Fallback
Given the breaking API changes, decouple your application logic from specific model versions so you can switch between 4.6 and 4.7 with a single parameter change.
Which Model Should You Use?
Use Opus 4.7 if:
- You’re building autonomous coding agents (the SWE-bench gains are real).
- You need high-resolution image understanding (UI screenshots, diagrams, dense dashboards).
- Your prompts are already well-structured and you want more literal instruction following.
- You can absorb a 20–40% effective cost increase for better quality.
Stick with Opus 4.6 if:
- You have tight token budgets that can’t accommodate 30%+ inflation.
- Your prompts rely on loose interpretation (you haven’t re-tuned for 4.7’s literalness).
- You need
temperatureor other sampling parameters. - Your use case doesn’t need the vision or coding improvements (e.g., simple chat, basic document Q&A).
The Bottom Line
Opus 4.7 is a genuinely better model, especially for software engineering, vision tasks, and long-running agentic workflows. The 87.6% on SWE-bench Verified and the 3× vision resolution upgrade are meaningful, not marketing hype.
But “same price” is misleading. Between the tokenizer inflation (up to 35% more input tokens) and the model’s tendency to “think more” (more output tokens), your effective cost per task could rise 30–50% in practice.
Anthropic has delivered better capability at the same per-token price, but completing any given task now requires more tokens. Your decision hinges on whether the quality gains justify the higher per-task expense.
For teams building production coding agents, the answer is likely yes, the 12-point gain on CursorBench and 3× production task resolution on Rakuten-SWE-Bench justify the cost. For simpler workloads or teams on tight budgets, Opus 4.6 remains a perfectly capable option.
