Sakana Fugu and Fable 5 benchmark comparison questioned, test scaffolding differences may lead to a 10-20 point deviation.

robot
Abstract generation in progress
ME AI News, according to Dongcha Beating monitoring, Japan-based AI startup Sakana AI claims that its multi-agent collaborative system Fugu Ultra has outperformed Anthropic's flagship model Fable 5 in multiple benchmark tests including scientific reasoning and programming, but the score conclusions have been widely questioned by the community. Critics point out that comparing self-tested data in non-unified test environments is not objective. Test scores are highly dependent on the running scaffold (Scaffold/Harness), and different scaffolds can lead to score deviations of 10 to 20 points, making the so-called "surpassing" largely a product of system engineering optimization rather than a generational leap in underlying model capabilities. Independent evaluation data shows that the agent running scaffold built around large models has a huge impact on final scores. Under the same Claude Opus 4.5 model, simply switching to three different open-source scaffolds results in a repair rate fluctuation of 50.2% to 55.4% in the SWE-bench Pro benchmark test. Further analysis by third-party testing agency Scale AI confirms that operational strategies such as prompt templates, maximum attempt limits, context retention management, and tool call integration can cause a score deviation of 10 to 20 points for the same set of model weights. Since the data released by Sakana AI and Anthropic are both based on their own closed-source scaffolds (Vendor Scaffold) tailored for their respective systems, and have not been uniformly tested in a standardized independent third-party environment (such as Scale SEAL), the data cannot truly reflect the relative strength of the underlying capabilities of the two models. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned