Long-term task benchmark test based on real browsing history released

AIMPACT announcement, April 30 (UTC+8), Dan Fried announced on X platform that his team has built a benchmark based on real user browsing history, including approximately 200 multi-site tasks designed to evaluate the success rate and efficiency of agents in long-duration tasks (many requiring several hours to complete). The related paper has been published, led by Lawrence K. and others. This work focuses on assessing agent performance in complex, long-span web tasks. (Source: InFoQ)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments