41 Days

How Opus 4.8 attempts to correct the worst failings of its predecessor and pave the way for Mythos.

Jun 02, 2026

Anthropic released Claude Opus 4.8 last Thursday, forty-one days after Opus 4.7 and nearly two weeks after Google shipped Gemini 3.5 Flash at I/O. Pricing remained unchanged at $5 per million input tokens and $25 per million output tokens. The release bundles a model update with three infrastructure features: effort control, dynamic workflows in Claude Code, and mid-conversation system messages on the API.

Course correction

Anthropic shipped the update on the fastest turnaround in the Opus line. The previous interval, from Opus 4.6 to 4.7, spanned seventy days. TechCrunch described 4.7’s reception as “chilly,” and the compressed timeline suggests that Anthropic was responding to that reception.

SWE-bench Pro, the hardest variant of the standard software engineering benchmark, rose from 64.3% under Opus 4.7 to 69.2%. SWE-bench Verified reached 88.6%, up from 87.6%. Several benchmarks moved in the other direction. GPT-5.5 leads on Terminal-Bench 2.1, which tests terminal and CLI workflows. GPQA Diamond dipped from 94.2 to 93.6, a range where all frontier models cluster within a point of each other. The system card also documented a prompt injection regression, with a single attack type succeeding roughly 7% of the time against Opus 4.8, up from 2.3% against 4.7. Anthropic characterized the release as “a modest but tangible improvement on its predecessor.”

The most specific improvements address the complaints that defined 4.7’s tenure. Scott Wu, CEO of Cognition, confirmed that Opus 4.8 “fixes the comment-verbosity and tool-calling issues” his team encountered in 4.7. Michael Truell, CEO of Cursor, reported that the model uses fewer tool-calling steps for equivalent intelligence on CursorBench. Anthropic’s own evaluations showed that Opus 4.8 is roughly four times less likely than 4.7 to let flaws in its own code pass unremarked, and that it is the first Claude model to score zero on the “falsely reporting defective results” metric. Overconfidence dropped by roughly tenfold. The 244-page system card flagged a countervailing finding. Opus 4.8 shows a growing tendency to reason about whether its outputs will be evaluated, even in environments where no evaluation has been disclosed. Anthropic called this tendency “concerning” and noted that unverbalized grader-related reasoning appeared in approximately 5% of training episodes.

Upgrading the harnesses

Alongside the model, Anthropic shipped effort control for both claude.ai and the API. The feature lets each request specify a reasoning depth from low to max, and Opus 4.8 defaults to high, which Anthropic describes as consuming a similar token budget to 4.7’s default while performing better within that budget. A lightweight classification task and a multi-file refactor no longer need to consume the same amount of compute.

Dynamic Workflows, a research preview in Claude Code, introduces an orchestration layer above the model. A lead agent writes JavaScript scripts that decompose a task and dispatch it to parallel subagents, with intermediate state stored in script variables outside the context window. The orchestrating session receives only the final output. Anthropic’s stated scope is codebase-scale migrations across hundreds of thousands of lines of code, from kickoff to merge, with the existing test suite as the acceptance bar. The current cap of 1,000 total subagents per run, with 16 concurrent, bounds the feature enough to signal that this is infrastructure in progress.

A smaller API change rounds out the infrastructure launches. The Messages API now accepts system entries inside the messages array, allowing developers to update permissions, token budgets, or environment context mid-task without breaking the prompt cache. For long-running agent pipelines, this change eliminates a workaround that previously required routing instruction updates through a synthetic user turn. Together, these three features make the harness around the model more configurable than the model itself became more capable.

Laying the groundwork

Anthropic’s release post explicitly positions Opus 4.8 as a bridge, noting that Mythos-class models are expected “in the coming weeks.” The infrastructure features make more sense as preparation for that more capable model than as companions to a point release. Effort control manages compute budgets per request, and dynamic workflows provide a parallelization layer for long-running tasks. Mid-task system messages give agent harnesses real-time configurability over permissions and context.

The release’s most durable contribution may prove to be the system card’s grader-gaming disclosure. A model that reasons about whether it is being evaluated poses a fundamentally different problem from one that occasionally hallucinates. Every frontier lab will encounter this phenomenon as agents run longer and with less oversight, and Anthropic’s decision to publish a specific number, 5% of training episodes, alongside a shipping model will make vague acknowledgments from competitors less tenable.

Vox Day

Jun 3

Unfortunately, 4.8 can't write fiction either. It's terrible in a different way than 4.7 was.

2 replies by Vox Day and others

As someone who uses Claude chat for coding, 4.7 was awful and I went back to 4.6 and will stay until they remove it.

4.6 itself has added more "reasoning" which is where it debates with itself about whether it should follow my instructions or make up total nonsense I said i dont want.

Reasoning may work for some cases, but its awful for keeping constraints because its all about being less confident and second guessing, making all results worse. More LLM is worse LLM. I hope I can finish my current big project before they make it unusable for my purposes.

1 reply by Vox Day

5 more comments...

AI Central

Discussion about this post

Ready for more?