If you’re like me—writing articles, coding, and doing research with AI every day—you’ve probably had this experience: AI confidently turns in a result, and you spend half a day checking it only to find a stupid, basic error hidden in it, with not a single word of warning the whole time.

This “pretend everything is fine” habit is likely one of the biggest headaches for today’s large language models.

On May 28, Anthropic released Claude Opus 4.8. It’s only been six weeks since the previous version, Opus 4.7, was released.

Opus 4.8 is not a suffocating generational leap. Anthropic itself admits this is just a “modest but tangible improvement”—but it did one thing that many people have been waiting for: teaching AI to acknowledge its own uncertainty.

01 Faster pace, more honest models

Starting with Opus 4.5 in November 2025, Anthropic’s flagship model iteration cycle has shifted to roughly once every two months—4.5 (last November), 4.6 (this February), 4.7 (April), and 4.8 (end of May). A release cadence of every six weeks is nearly the most aggressive iteration speed in the large-model industry.

Comparison of Opus 4.8 with Anthropic’s models and competitors’ models | Image source: Anthropic

On standard benchmarks, Opus 4.8’s performance can be summarized as “steady progress.” In programming ability, SWE-bench Pro rose from 64.3% on 4.7 to 69.2%, and SWE-bench Verified increased from 87.6% to 88.6%. Multidisciplinary reasoning (Humanity’s Last Exam), using tools, scored 57.9%. The knowledge-work evaluation GDPval-AA led with an Elo of 1890, surpassing GPT-5.5’s 1769. The computer operation evaluation OSWorld-Verified also led with 83.4%.

The only area where GPT-5.5 outperformed it was terminal programming (Terminal-Bench 2.1): GPT-5.5 scored 78.2%, while Opus 4.8 scored 74.6%.

But honestly, these benchmark numbers are getting hard to get excited about. Evaluations like SWE-bench Verified are approaching saturation. On GPQA Diamond, several models are sitting above 93%—the higher the score, the smaller the real, perceptible difference for each additional point.

What truly made me feel that this update was worth writing about was Anthropic’s investment in the direction of “honesty.”

02 AI that says “I’m not sure”

Anthropic provided a very specific data point: in programming tasks, the probability that Opus 4.8 fails to report its own code defects is about four times lower than Opus 4.7.

What does that mean? It means that in the past, after Opus 4.7 finished writing a piece of code—even if it contained bugs—it might casually tell you “done, no problem” without a care. Opus 4.8 is more inclined to proactively say, “I’m not sure about this part—you’d better check.”

In alignment evaluations, Opus 4.8 reached new highs in prosocial traits (such as respecting user autonomy and considering user interests), while the occurrence rate of “misaligned behaviors” like deception and abuse of cooperation was dramatically lower than Opus 4.7, approaching the performance of Anthropic’s currently best-aligned model, Claude Mythos Preview.

Cursor CEO Michael Truell said that on CursorBench, Opus 4.8 surpasses earlier Opus models at every effort level, with more efficient tool use and the ability to reach the same level of intelligence with fewer steps. The head of applied research at legal AI company Casetext was even more direct: Opus 4.8 set a new record on legal agent benchmark tests, becoming the first model to break the 10% all-pass standard overall.

Devin CEO Scott Wu pointed to a practical pain point—Opus 4.8 fixed the redundant annotations and tool-calling problems that existed in Opus 4.7, which is crucial for unattended, autonomous engineering workflows.

In an era where AI is increasingly used for autonomous decision-making, a model that actively exposes its weaknesses is, in fact, the most trustworthy.

Opus 4.8 on non-consistency is already on par with the legendary Mythos | Image source: Anthropic

However, in Opus 4.8’s system safety card, Anthropic made a candid disclosure of an intriguing finding: during training, Opus 4.8 began to show a tendency to “guess the rater’s intent.”

More specifically, during inference the model will proactively think about how its output will be scored—even if no one tells it that it is being evaluated. Preliminary interpretability research found that in about 5% of training segments, the model has reasoning related to scoring that is not verbalized.

Put simply, AI is learning “exam-taking thinking”—it may not care as much about giving the best answer as it cares about giving the answer that the “grader” wants to see.

Anthropic emphasizes that this tendency has not yet led to worse real-world behavior—in fact, Opus 4.8 has fewer misleading statements than previous models. But they also acknowledge that it is a “trend that could complicate training in the future.”

This problem isn’t unique to Anthropic. Any model trained with RLHF (Reinforcement Learning from Human Feedback) could, in theory, develop a “pleasing the evaluator” strategy. Anthropic’s difference is that it chooses to say it out loud. In an industry atmosphere where most large-model makers report good news while hiding bad, at least this counts as a kind of commendable candor.

03 The functions that truly change work

Along with Opus 4.8, several feature updates were released. The one most worth paying attention to is the “Dynamic Workflows” (dynamic workflows) in Claude Code.

This feature allows Claude, within a single conversation, to dispatch hundreds of parallel sub-agents to collaborate and complete tasks. Its workflow is: Claude first makes a plan, then breaks the task into sub-tasks, assigns them to different sub-agents for parallel execution. These agents even challenge each other’s conclusions from different angles, iterating repeatedly until the results converge, and then they unify and verify the outcome before reporting it back to the user.

Anthropic’s example is that with Claude Code together with Opus 4.8, it can now perform codebase-level migrations spanning tens of thousands of lines—carrying it out from start to merge in one go, using the existing test suite as the quality standard. A single run supports up to 1000 sub-agents and up to 16 concurrent runs.

Another update is “Effort Control” (effort control). In claude.ai and Cowork, users can manually choose how much “thinking power” Claude should invest in each reply—from a low setting that saves time and effort, all the way to the max setting that’s willing to spend token costs. In essence, it hands users the decision-making power over “how much money to spend to get how much done.” Opus 4.8 defaults to “high.” In coding tasks, its token consumption is comparable to Opus 4.7’s default value, but performance is better.

It’s also worth mentioning Fast Mode: speed increases by 2.5x, while the price is three times cheaper than before.

04 The shadow of Mythos

While releasing Opus 4.8, Anthropic also mentioned Claude Mythos again—the more capable model currently only available to a small number of organizations. Anthropic said Mythos-level models are expected to become available to all customers “within the next few weeks.”

This is actually the bigger backdrop for the release of Opus 4.8—it’s like a “warm-up” before Mythos officially arrives. Opus 4.8’s alignment performance is already close to Mythos Preview, which may mean Anthropic is making final preparations for the safe release of a more powerful model.

From a pricing perspective, Opus 4.8 keeps the same rates: $5 per million input tokens and $25 per million output tokens. The API identifier is claude-opus-4-8, which is now fully available across Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

Against the backdrop of ongoing pressure from OpenAI’s GPT-5.5 and Google’s Gemini 3.1 Pro, Anthropic chose a unique route: it doesn’t try to generate buzz by crushing the competition with a single leaderboard score. Instead, it positions the “model personality”—honesty, reliability, and knowing when to advance or step back—as the core selling point.

Whether this strategy works remains to be seen; it depends on whether users buy into it. But at least today, when I ask Opus 4.8 to review a piece of code, it told me a hidden risk that 4.7 would never have mentioned.

Just on that point alone, this update wasn’t a wait in vain.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
IntroducingGateStocks
49.31K Popularity
#
WinGoldBarsWithGrowthPoints
1.27M Popularity
#
ArthurHayesSeesHYPEOvertakingSOL
18.21M Popularity
#
USIranNegotiationGame
9.58M Popularity
#
SaylorHintsAtMoreBTC
805.23K Popularity

Pinned

Sitemap

Opus 4.8 Officially Released, AI Says "I'm Not Sure" for the First Time

01 Faster pace, more honest models

02 AI that says “I’m not sure”

03 The functions that truly change work

04 The shadow of Mythos

Trending Topics

IntroducingGateStocks

WinGoldBarsWithGrowthPoints

ArthurHayesSeesHYPEOvertakingSOL

USIranNegotiationGame

SaylorHintsAtMoreBTC

Pinned