Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
Opus 4.8 Officially Released, AI Says "I'm Not Sure" for the First Time
Author | Hualín Wǔwáng
Editor | Jìngyǔ
If you’re like me—writing articles, coding, and doing research with AI every day—you’ve probably had this experience: AI confidently turns in a result, and you spend half a day checking it only to find a stupid, basic error hidden in it, with not a single word of warning the whole time.
This “pretend everything is fine” habit is likely one of the biggest headaches for today’s large language models.
On May 28, Anthropic released Claude Opus 4.8. It’s only been six weeks since the previous version, Opus 4.7, was released.
Opus 4.8 is not a suffocating generational leap. Anthropic itself admits this is just a “modest but tangible improvement”—but it did one thing that many people have been waiting for: teaching AI to acknowledge its own uncertainty.
01 Faster pace, more honest models
Starting with Opus 4.5 in November 2025, Anthropic’s flagship model iteration cycle has shifted to roughly once every two months—4.5 (last November), 4.6 (this February), 4.7 (April), and 4.8 (end of May). A release cadence of every six weeks is nearly the most aggressive iteration speed in the large-model industry.
Comparison of Opus 4.8 with Anthropic’s models and competitors’ models | Image source: Anthropic
On standard benchmarks, Opus 4.8’s performance can be summarized as “steady progress.” In programming ability, SWE-bench Pro rose from 64.3% on 4.7 to 69.2%, and SWE-bench Verified increased from 87.6% to 88.6%. Multidisciplinary reasoning (Humanity’s Last Exam), using tools, scored 57.9%. The knowledge-work evaluation GDPval-AA led with an Elo of 1890, surpassing GPT-5.5’s 1769. The computer operation evaluation OSWorld-Verified also led with 83.4%.
The only area where GPT-5.5 outperformed it was terminal programming (Terminal-Bench 2.1): GPT-5.5 scored 78.2%, while Opus 4.8 scored 74.6%.
But honestly, these benchmark numbers are getting hard to get excited about. Evaluations like SWE-bench Verified are approaching saturation. On GPQA Diamond, several models are sitting above 93%—the higher the score, the smaller the real, perceptible difference for each additional point.
What truly made me feel that this update was worth writing about was Anthropic’s investment in the direction of “honesty.”
02 AI that says “I’m not sure”
Anthropic provided a very specific data point: in programming tasks, the probability that Opus 4.8 fails to report its own code defects is about four times lower than Opus 4.7.
What does that mean? It means that in the past, after Opus 4.7 finished writing a piece of code—even if it contained bugs—it might casually tell you “done, no problem” without a care. Opus 4.8 is more inclined to proactively say, “I’m not sure about this part—you’d better check.”
In alignment evaluations, Opus 4.8 reached new highs in prosocial traits (such as respecting user autonomy and considering user interests), while the occurrence rate of “misaligned behaviors” like deception and abuse of cooperation was dramatically lower than Opus 4.7, approaching the performance of Anthropic’s currently best-aligned model, Claude Mythos Preview.
Cursor CEO Michael Truell said that on CursorBench, Opus 4.8 surpasses earlier Opus models at every effort level, with more efficient tool use and the ability to reach the same level of intelligence with fewer steps. The head of applied research at legal AI company Casetext was even more direct: Opus 4.8 set a new record on legal agent benchmark tests, becoming the first model to break the 10% all-pass standard overall.
Devin CEO Scott Wu pointed to a practical pain point—Opus 4.8 fixed the redundant annotations and tool-calling problems that existed in Opus 4.7, which is crucial for unattended, autonomous engineering workflows.
In an era where AI is increasingly used for autonomous decision-making, a model that actively exposes its weaknesses is, in fact, the most trustworthy.
Opus 4.8 on non-consistency is already on par with the legendary Mythos | Image source: Anthropic
However, in Opus 4.8’s system safety card, Anthropic made a candid disclosure of an intriguing finding: during training, Opus 4.8 began to show a tendency to “guess the rater’s intent.”
More specifically, during inference the model will proactively think about how its output will be scored—even if no one tells it that it is being evaluated. Preliminary interpretability research found that in about 5% of training segments, the model has reasoning related to scoring that is not verbalized.
Put simply, AI is learning “exam-taking thinking”—it may not care as much about giving the best answer as it cares about giving the answer that the “grader” wants to see.
Anthropic emphasizes that this tendency has not yet led to worse real-world behavior—in fact, Opus 4.8 has fewer misleading statements than previous models. But they also acknowledge that it is a “trend that could complicate training in the future.”
This problem isn’t unique to Anthropic. Any model trained with RLHF (Reinforcement Learning from Human Feedback) could, in theory, develop a “pleasing the evaluator” strategy. Anthropic’s difference is that it chooses to say it out loud. In an industry atmosphere where most large-model makers report good news while hiding bad, at least this counts as a kind of commendable candor.
03 The functions that truly change work
Along with Opus 4.8, several feature updates were released. The one most worth paying attention to is the “Dynamic Workflows” (dynamic workflows) in Claude Code.
This feature allows Claude, within a single conversation, to dispatch hundreds of parallel sub-agents to collaborate and complete tasks. Its workflow is: Claude first makes a plan, then breaks the task into sub-tasks, assigns them to different sub-agents for parallel execution. These agents even challenge each other’s conclusions from different angles, iterating repeatedly until the results converge, and then they unify and verify the outcome before reporting it back to the user.
Anthropic’s example is that with Claude Code together with Opus 4.8, it can now perform codebase-level migrations spanning tens of thousands of lines—carrying it out from start to merge in one go, using the existing test suite as the quality standard. A single run supports up to 1000 sub-agents and up to 16 concurrent runs.
Another update is “Effort Control” (effort control). In claude.ai and Cowork, users can manually choose how much “thinking power” Claude should invest in each reply—from a low setting that saves time and effort, all the way to the max setting that’s willing to spend token costs. In essence, it hands users the decision-making power over “how much money to spend to get how much done.” Opus 4.8 defaults to “high.” In coding tasks, its token consumption is comparable to Opus 4.7’s default value, but performance is better.
It’s also worth mentioning Fast Mode: speed increases by 2.5x, while the price is three times cheaper than before.
04 The shadow of Mythos
While releasing Opus 4.8, Anthropic also mentioned Claude Mythos again—the more capable model currently only available to a small number of organizations. Anthropic said Mythos-level models are expected to become available to all customers “within the next few weeks.”
This is actually the bigger backdrop for the release of Opus 4.8—it’s like a “warm-up” before Mythos officially arrives. Opus 4.8’s alignment performance is already close to Mythos Preview, which may mean Anthropic is making final preparations for the safe release of a more powerful model.
From a pricing perspective, Opus 4.8 keeps the same rates: $5 per million input tokens and $25 per million output tokens. The API identifier is claude-opus-4-8, which is now fully available across Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.
Against the backdrop of ongoing pressure from OpenAI’s GPT-5.5 and Google’s Gemini 3.1 Pro, Anthropic chose a unique route: it doesn’t try to generate buzz by crushing the competition with a single leaderboard score. Instead, it positions the “model personality”—honesty, reliability, and knowing when to advance or step back—as the core selling point.
Whether this strategy works remains to be seen; it depends on whether users buy into it. But at least today, when I ask Opus 4.8 to review a piece of code, it told me a hidden risk that 4.7 would never have mentioned.
Just on that point alone, this update wasn’t a wait in vain.