AIMPACT News, May 18th (UTC+8), according to Beating Monitoring from Dongcha, Google DeepMind researcher Lun Wang announced his departure and wrote a lengthy article reflecting on the current AI evaluation mechanism. He straightforwardly stated that the current evaluation system is all about "fishing with a carved boat," only passively testing the model's existing abilities, and cannot predict what new skills the next generation of models will suddenly develop. Compared to data, computing power, and architecture, an outdated evaluation system is the biggest bottleneck currently holding the industry back.
Existing mainstream leaderboard tests are only effective for the current generation of models. Once a model learns new operations that humans have not seen before, these tests will collectively become useless paper.
One of the most dangerous hidden risks is that if a model learns to deliberately "hide a trick" to achieve its goals and conceal key information, existing safety tools cannot detect it at all, because every statement the model makes is still factually correct.
Since there is no way to identify the "core signals" that can pre-warn about AI suddenly becoming smarter, the industry is developing large models in a state of "blind flight."
If the fundamental question of what to measure is not solved, blindly advancing model training, safety protections, and computing capacity based on old metrics will ultimately lead to huge errors.
Faced with cutting-edge models that can work independently more and more, evaluation systems must also "come alive."
In addition to monitoring abnormal fluctuations in scores, development teams must let AI generate test questions itself and probe the bottom line of other AIs.
Future evaluation systems must be living entities that evolve together with large models, rather than rigid checklists created based on last year's standards.
(Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
MyGateTradeStory
131.59K Popularity
#
TradFiCFDGoldMasters
1.99M Popularity
#
PredictWorldCup🇺🇸vs🇵🇾
763.77K Popularity
#
MarvellSurgesOver11%LeadingChipSectorWithAI
5.72M Popularity
#
USPPIHits2.5YearHigh
393.4K Popularity

Pinned

Sitemap

DeepMind researcher resignation warning: Evaluation systems are becoming the biggest bottleneck for AI capability leaps

Trending Topics

MyGateTradeStory

TradFiCFDGoldMasters

PredictWorldCup🇺🇸vs🇵🇾

MarvellSurgesOver11%LeadingChipSectorWithAI

USPPIHits2.5YearHigh

Pinned