They tested all 18 combinations in full; the experimental volume is substantial, and the NUS team has truly proven its capabilities.

View Original
MeNews
NUS team releases GameWorld benchmark, evaluating multimodal AI agents across 34 browser games
NUS team releases the GameWorld benchmark, including 34 browser games and 170 tasks, with verifiable metrics for objective evaluation. Tests two types of proxy interfaces: computer-use, which involves direct keyboard and mouse commands, and a general multimodal proxy that operates within semantic space. Empirical results on 18 model-interface combinations show that even the best performers are far below human levels, with challenges remaining in real-time latency, context memory sensitivity, and action effectiveness. The related paper and code are publicly available on HuggingFace and GitHub.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned