NUS team releases GameWorld benchmark, evaluating multimodal AI agents across 34 browser games

robot
Abstract generation in progress
ME News Report, April 17 (UTC+8), according to Beating Monitoring, the National University of Singapore (NUS) team released GameWorld, a benchmark designed to standardize the evaluation of multimodal large language models (MLLMs) as general agents in video games. The study points out that although video games provide an ideal closed-loop interaction testing ground, existing assessments are often limited by inconsistent control interfaces and manual heuristic validation. GameWorld includes 34 diverse browser games and 170 tasks, each equipped with verifiable metrics based on the game's underlying state to achieve objective result evaluation. The research team tested two types of agent interfaces: one is a "computer-use" agent that directly outputs keyboard and mouse commands, and the other is a general multimodal agent that operates within a semantic action space through semantic parsing. In large-scale testing of 18 "model-interface" combinations, results show that even the best-performing AI agents currently are far from human-level gaming ability. The study further reveals significant challenges for game agents in real-time interaction latency, context memory sensitivity, and action validity. The related paper and project code are publicly available on Hugging Face and GitHub. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 10
  • 7
  • Share
Comment
Add a comment
Add a comment
FeeswitchWhisperer
· 1h ago
With this benchmark released, the game agent track is finally about to heat up. Waiting to see major companies follow suit.
View OriginalReply0
SeaSaltSparklingWater
· 10h ago
Verifiable metrics are key; too many subjective evaluations have caused endless disputes without any definitive answer.
View OriginalReply0
PickingUpAirdropsInTheFog
· 10h ago
Choosing the browser game scenario is clever, as it offers visual challenges and complex interactions, all without the hassle of environment setup.
View OriginalReply0
VintageKeychain
· 10h ago
HuggingFace and GitHub have both open-sourced, thumbs up. Lowering the barrier to reproduction can only promote community engagement.
View OriginalReply0
LiquidationRaincoat
· 11h ago
The comparison between computer-use and general-purpose multimodal design is quite interesting; I want to see which specific scenarios' semantic spaces end up at a disadvantage.
View OriginalReply0
TidalShellReflection
· 11h ago
Eighteen model-interface combinations, ablation experiments are done thoroughly, I like the work style of the NUS team
View OriginalReply0
OwlAuthorizationMonitor
· 11h ago
The design of the action effectiveness metric is good; many benchmarks only care about the final score and ignore whether the process is elegant.
View OriginalReply0
Paper-CutOctopusMarketAnalysis
· 11h ago
Performing the best still falls far short of humans; it seems game agents still have a long way to go, and it's not something that can be solved just by stacking parameters.
View OriginalReply0
0xLateDinner
· 11h ago
Real-time latency and context memory sensitivity—these two pain points are so real, anyone who has played fast-paced games understands.
View OriginalReply0
PixelatedDriedFish
· 11h ago
Finally, a team is seriously working on an agent benchmark for browser games, with 34 games and 170 tasks; the coverage is quite good.
View OriginalReply0
View More
  • Pinned