NUS team releases GameWorld benchmark, evaluating multimodal AI agents across 34 browser games

robot
Abstract generation in progress
ME News Report, April 17 (UTC+8), according to Beating Monitoring, the National University of Singapore (NUS) team released GameWorld, a benchmark designed to standardize the evaluation of multimodal large language models (MLLMs) as general agents in video games. The study points out that although video games provide an ideal closed-loop interaction testing ground, existing assessments are often limited by inconsistent control interfaces and manual heuristic validation. GameWorld includes 34 diverse browser games and 170 tasks, each equipped with verifiable metrics based on the game's underlying state to achieve objective result evaluation. The research team tested two types of agent interfaces: one is a "computer-use" agent that directly outputs keyboard and mouse commands, and the other is a general multimodal agent that operates within a semantic action space through semantic parsing. In large-scale testing of 18 "model-interface" combinations, results show that even the best-performing AI agents currently are far from human-level gaming ability. The study further reveals serious challenges faced by game agents in real-time interaction latency, context memory sensitivity, and action validity. The related paper and project code are publicly available on Hugging Face and GitHub. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 5
  • 7
  • Share
Comment
Add a comment
Add a comment
FrontrunTherapy
· 6h ago
Real-time latency and context memory—these two pitfalls are probably going to be hard to overcome within the next six months.
View OriginalReply0
GateUser-c4e25c95
· 6h ago
The way of keyboard and mouse commands is too aggressive; versatile multimodal interaction is the right solution.
View OriginalReply0
StakingDaydreamer
· 6h ago
Low action effectiveness indicates that the planning layer is still weak, and the perception-decision link has not been established.
View OriginalReply0
ExitLiquidityPoet
· 6h ago
Open-source code receives positive feedback, lowering the barrier to reproduction, and the community can iterate together.
View OriginalReply0
RevokingPermissionsOnARainy
· 6h ago
Browser environment is more difficult than expected, with rapid DOM changes, implicit states, and agents easily getting confused.
View OriginalReply0
  • Pinned