AA-Briefcase发布:Claude Fable 5夺冠,GLM-5.2挤进前三

robot
Abstract generation in progress
CryptoWorld News reports that AA-Briefcase announced that Claude Fable 5 won the top spot in the evaluations, while GLM-5.2 edged into the top three. The evaluation organization Artificial Analysis has released the first long-term knowledge-work benchmark specifically designed for large model agents, covering four scenarios: data science, product management, banking operations, and heavy-industry strategy. Developed by industry experts from Google, McKinsey, and Boston Consulting, it includes 91 tasks intended to simulate real, complex business project workflows. The results show that Claude Fable 5 achieved the highest overall score, with Claude Opus 4.8 and GLM-5.2 ranking second and third, respectively. Despite Claude Fable 5’s strong performance, under strict all-or-nothing standards for individual tasks, its perfect rate was only 3%. In the open-source model category, Zhipu GLM-5.2 performs remarkably well: its overall score is only 90 points lower than Claude Opus 4.8, but its operating cost is less than 25%.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 4
  • 1
  • Share
Comment
Add a comment
Add a comment
MempoolMaggie
· 6h ago
Claude Fable 5 wins but has a perfect rate of 3%, which is quite disheartening, indicating that long-term tasks are still hellishly difficult for AI.
View OriginalReply0
SandwichAlertAgent
· 6h ago
Opus 4.8's second position is a bit awkward, expensive yet unstable, Anthropic needs to think about how to tell the story.
View OriginalReply0
BridgeHopRanger
· 6h ago
Open-source GLM-5.2 is insanely cost-effective, with a 90-point performance gap but 75% lower costs, and companies are re-evaluating their procurement budgets.
View OriginalReply0
GlassDomeObservatory
· 6h ago
Item 91 covers four industries, endorsed by Google and McKinsey, I believe in the value of this benchmark.
View OriginalReply0
  • Pinned