Microsoft open-sources Phi-Ground: 4 billion parameters achieve click accuracy better than Operator and Claude

robot
Abstract generation in progress

CryptoWorld News: Microsoft has open-sourced the Phi-Ground model family, specifically designed to address the issue of “where on the screen” when AI manipulates computers. The open-source 4 billion parameter version achieved over 90% click accuracy on the Showdown benchmark, surpassing OpenAI’s Operator and Claude, and secured first place in all five evaluations, including Screenspot-Pro, among models below 4k parameters. The team validated the model with a large-scale dataset of over 40 million entries, discovering that the three training techniques commonly used in academic papers all become ineffective as data volume increases. An effective approach is to treat coordinates as ordinary numbers, such as “523, 417”. The team also found that reinforcement learning is useful for purely visual tasks, by having the model make multiple click predictions on the same image and comparing the results of correct and incorrect points for training. To address the issue of buttons being too small on 4K high-resolution screens, the team during training scaled down screenshots proportionally and pasted them onto a large white canvas, simulating the real scenario of extremely small elements on high-resolution displays. This trick is especially effective in complex professional software like Photoshop.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin