ME News Report, April 17 (UTC+8), according to Beating Monitoring, the National University of Singapore (NUS) team released GameWorld, a benchmark designed to standardize the evaluation of multimodal large language models (MLLMs) as general agents in video games. The study points out that although video games provide an ideal closed-loop interaction testing ground, existing assessments are often limited by inconsistent control interfaces and manual heuristic validation. GameWorld includes 34 diverse browser games and 170 tasks, each equipped with verifiable metrics based on the game's underlying state to achieve objective result evaluation. The research team tested two types of agent interfaces: one is a "computer-use" agent that directly outputs keyboard and mouse commands, and the other is a general multimodal agent that operates within a semantic action space through semantic parsing. In large-scale testing of 18 "model-interface" combinations, results show that even the best-performing AI agents currently are far from human-level gaming ability. The study further reveals significant challenges for game agents in real-time interaction latency, context memory sensitivity, and action validity. The related paper and project code are publicly available on Hugging Face and GitHub. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

11 Likes

Reward
11
10
7
Share

Comment

Add a comment

FeeswitchWhisperer

· 1h ago

With this benchmark released, the game agent track is finally about to heat up. Waiting to see major companies follow suit.

View OriginalReply0

SeaSaltSparklingWater

· 10h ago

Verifiable metrics are key; too many subjective evaluations have caused endless disputes without any definitive answer.

View OriginalReply0

PickingUpAirdropsInTheFog

· 10h ago

Choosing the browser game scenario is clever, as it offers visual challenges and complex interactions, all without the hassle of environment setup.

View OriginalReply0

VintageKeychain

· 10h ago

HuggingFace and GitHub have both open-sourced, thumbs up. Lowering the barrier to reproduction can only promote community engagement.

View OriginalReply0

LiquidationRaincoat

· 11h ago

The comparison between computer-use and general-purpose multimodal design is quite interesting; I want to see which specific scenarios' semantic spaces end up at a disadvantage.

View OriginalReply0

TidalShellReflection

· 11h ago

Eighteen model-interface combinations, ablation experiments are done thoroughly, I like the work style of the NUS team

View OriginalReply0

OwlAuthorizationMonitor

· 11h ago

The design of the action effectiveness metric is good; many benchmarks only care about the final score and ignore whether the process is elegant.

View OriginalReply0

Paper-CutOctopusMarketAnalysis

· 11h ago

Performing the best still falls far short of humans; it seems game agents still have a long way to go, and it's not something that can be solved just by stacking parameters.

View OriginalReply0

0xLateDinner

· 11h ago

Real-time latency and context memory sensitivity—these two pain points are so real, anyone who has played fast-paced games understands.

View OriginalReply0

PixelatedDriedFish

· 11h ago

Finally, a team is seriously working on an agent benchmark for browser games, with 34 games and 170 tasks; the coverage is quite good.

View OriginalReply0

Trending Topics
View More
#
TradfiTradingChallenge
237.07K Popularity
#
GrayscaleBuysAndStakesOver510KHYPE
8.92M Popularity
#
DailyPolymarketHotspot
1.01M Popularity
#
SpaceXOfficiallyFilesforIPO
751.82K Popularity
#
GateSquarePizzaDay
1.71M Popularity

Pinned

Sitemap

NUS team releases GameWorld benchmark, evaluating multimodal AI agents across 34 browser games

Trending Topics

TradfiTradingChallenge

GrayscaleBuysAndStakesOver510KHYPE

DailyPolymarketHotspot

SpaceXOfficiallyFilesforIPO

GateSquarePizzaDay

Pinned