Opus 4.7: The Hidden Math

Apr 20266 minFirst posted on LinkedIn

AI Engineering

Opus 4.7: The Hidden Math

Three days in, the benchmarks say SOTA. The migration docs and my own usage data tell a different story. The tokenizer shift, the MRCR drop, and a practical framework for when 4.7 is actually worth the bill.

Anthropic shipped Opus 4.7 and the headline benchmarks all moved up. Before you swap your default model, the interesting story is in the migration notes and the bill.

The tokenizer changed

4.7 ships a revised tokenizer. For English-heavy prompts it's a wash; for code, structured output, and non-English text the token counts drift noticeably from 4.6. That matters for two reasons: your cost projections (priced per token) and your context budgets (sized in tokens). A 30k-char file that fit comfortably in a 4.6 context window may push you closer to the edge in 4.7. Re-measure before you assume.

MRCR is not a free upgrade

Multi-Round Conversational Recall (MRCR) measures how reliably the model can pull facts from earlier in a long conversation. 4.7's published numbers on MRCR aren't strictly better than 4.6 across the board - for some long-context workloads, 4.6 still edges it. If your agent depends on stable recall of facts dropped 50k tokens ago, A/B test before you migrate the whole system.

The cost picture from real usage

Pulling my own daily reports: a typical day mixing haiku and opus runs around $40-$80. Days that are heavy opus + agentic tool-use can blow past $200. Cache reads are doing the bulk of the work - they account for the vast majority of token volume and most of the cost. A model swap that changes how the cache hits (different tokeniser → different cache keys) can transiently spike your bill until the cache repopulates.

When 4.7 is actually worth it

Hard reasoning tasks where 4.6 was already at its limit - code refactors that touch many files, multi-step planning, complex SQL generation.
Workloads where small accuracy improvements compound - agents that loop (a 2% reliability bump per step compounds across 20 steps).
Net-new context where you're already paying full input cost - the tokeniser change has the smallest blast radius here.

When to stay on 4.6

Long-context workloads that depend heavily on recall from earlier turns.
Production agents whose cost ceilings are tight and have stable cache patterns.
Anything where you have measured reliability and can't afford a quiet regression.

The honest take

Don't read the leaderboard and switch. Read the migration notes. Run the same task on both for a week. Watch your cost graph and your eval pass-rate. Then move (or don't) with data, not with the press release.

TWITTER LINKEDIN

Mitul Jagad

Opus 4.7: The Hidden Math