computer-use vs semantic proxy, two technical approaches directly competing, data speaks for itself

View Original
MeNews
NUS team releases GameWorld benchmark, evaluating multimodal AI agents across 34 browser games
NUS team releases the GameWorld benchmark, including 34 browser games and 170 tasks, with verifiable metrics for objective evaluation. Tests two types of proxy interfaces: computer-use, which involves direct keyboard and mouse commands, and general multimodal agents operating in semantic space. Empirical results on 18 model-interface combinations show that even the best performers are far below human levels, with challenges remaining in real-time latency, context memory sensitivity, and action effectiveness. The related paper and code are publicly available on HuggingFace and GitHub.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned