ME News Report, April 17 (UTC+8), according to Beating Monitoring, the National University of Singapore (NUS) team released GameWorld, a benchmark designed to standardize the evaluation of multimodal large language models (MLLMs) as general agents in video games. The study points out that although video games provide an ideal closed-loop interaction testing ground, existing assessments are often limited by inconsistent control interfaces and manual heuristic validation. GameWorld includes 34 diverse browser games and 170 tasks, each equipped with verifiable metrics based on the game's underlying state to achieve objective result evaluation. The research team tested two types of agent interfaces: one is a "computer-use" agent that directly outputs keyboard and mouse commands, and the other is a general multimodal agent that operates within a semantic action space through semantic parsing. In large-scale testing of 18 "model-interface" combinations, results show that even the best-performing AI agents currently are far from human-level gaming ability. The study further reveals serious challenges faced by game agents in real-time interaction latency, context memory sensitivity, and action validity. The related paper and project code are publicly available on Hugging Face and GitHub. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

13 Likes

Reward
13
5
7
Share

Comment

Add a comment

FrontrunTherapy

· 6h ago

Real-time latency and context memory—these two pitfalls are probably going to be hard to overcome within the next six months.

View OriginalReply0

GateUser-c4e25c95

· 6h ago

The way of keyboard and mouse commands is too aggressive; versatile multimodal interaction is the right solution.

View OriginalReply0

StakingDaydreamer

· 6h ago

Low action effectiveness indicates that the planning layer is still weak, and the perception-decision link has not been established.

View OriginalReply0

ExitLiquidityPoet

· 6h ago

Open-source code receives positive feedback, lowering the barrier to reproduction, and the community can iterate together.

View OriginalReply0

RevokingPermissionsOnARainy

· 6h ago

Browser environment is more difficult than expected, with rapid DOM changes, implicit states, and agents easily getting confused.

View OriginalReply0

Trending Topics
View More
#
TradfiTradingChallenge
240.52K Popularity
#
HYPEOutperformsAgain
16.33M Popularity
#
DailyPolymarketHotspot
1.02M Popularity
#
GateSquarePizzaDay
1.71M Popularity
#
SpaceXOfficiallyFilesforIPO
6.38K Popularity

Pinned

Sitemap

NUS team releases GameWorld benchmark, evaluating multimodal AI agents across 34 browser games

Trending Topics

TradfiTradingChallenge

HYPEOutperformsAgain

DailyPolymarketHotspot

GateSquarePizzaDay

SpaceXOfficiallyFilesforIPO

Pinned