ME AI Message, according to Beating Monitoring, UC Berkeley's RDI, in collaboration with hundreds of industry experts, has launched a new AI agent evaluation benchmark called Agents' Last Exam (ALE), designed to assess the ability of intelligent agents to complete real digital professional tasks. ALE covers 55 subfields of digital professions, collecting over 1,500 verification tasks from actual projects by human experts, supporting result verification in both GUI and CLI interaction environments. The initial tests included cutting-edge systems such as Fable 5, GPT-5.5, and Composer 2.5. According to the latest official comparison, in the most challenging tasks requiring continuous reasoning and deep professional knowledge, all tested intelligent agents had a success rate of 0%, with Fable 5, released just this week, also submitting a blank sheet. This is mainly because the evaluation triggered safety policies, causing about 35% of Fable 5's tasks to be rolled back and switched to the older Opus 4.8 version, resulting in overall performance far below other leaderboards. In terms of single-task API costs, Fable 5 is approximately $15.70, far higher than GPT-5.5's $3.80 and Composer 2.5's $1.33, with costs for the same task being 4 to 12 times higher. The tests also found that the most common reason for intelligent agent failure was prematurely declaring success, rushing to finish without actual result verification, or even missing files or miscalculating data. For command-line intelligent agents, the evaluation team simultaneously released a subset called ALE-CLI. Compared to existing Terminal-Bench and SWE-bench-Pro, ALE-CLI covers 40 subfields, with human average time per task reaching several hours or even weeks. In command-line evaluations, the best-performing intelligent agents achieved only a 25.2% pass rate. The evaluation team pointed out that the era of user-friendly intelligent agents has arrived, but there is still a long way to go before they can truly replace humans in the workforce. (Source: MLion)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
MyGateTradeStory
137.5K Popularity
#
TradFiCFDGoldMasters
1.99M Popularity
#
PredictWorldCup🇺🇸vs🇵🇾
767.16K Popularity
#
MarvellSurgesOver11%LeadingChipSectorWithAI
5.72M Popularity
#
USPPIHits2.5YearHigh
393.66K Popularity

Pinned

Sitemap

Agent Onboarding Exam: Fable 5's hardest task still results in a blank submission, with the cost per question being 4 to 12 times higher

Trending Topics

MyGateTradeStory

TradFiCFDGoldMasters

PredictWorldCup🇺🇸vs🇵🇾

MarvellSurgesOver11%LeadingChipSectorWithAI

USPPIHits2.5YearHigh

Pinned