Microsoft open-sources Phi-Ground: 4 billion parameters achieve click accuracy better than Operator and Claude

robot
Abstract generation in progress

According to Beating Monitoring, Microsoft has open-sourced the Phi-Ground model family, which is designed specifically to solve the problem of “where on the screen” when AI controls a computer. With a screenshot and an instruction, the model outputs precise click coordinates. After pairing the open-source 4 billion parameter version with a large model for instruction planning, it achieved click accuracy above OpenAI Operator and Claude Computer Use on the Showdown benchmark, and took first place across all five evaluations, including ScreenSpot-Pro, for models below 4 billion parameters.

The team conducted large-scale validation using more than 40 million data points and found that the three training techniques commonly used in previous academic papers all failed when the dataset size was increased. The truly effective method is simple: output coordinates as ordinary numbers directly, such as “523, 417.” Several prior papers invented a special vocabulary of position words for coordinates, hoping the model would speak coordinates the way it speaks words, but with large-scale training these new tokens could not be learned properly and instead caused the model to collapse. Another key is to input the text instruction before the image. Large models process information in one direction: once they read “click the blue settings icon” and only then look at the image, when handling pixels they already know what to look for; conversely, if they look at the image first, the model can only scan blindly, resulting in much worse performance.

The team also found that reinforcement learning is useful even for purely visual tasks. The specific approach is to have the model make multiple click predictions on the same image, then select correct versus incorrect point outcomes for comparison-based training (this method is called DPO, which is a type of reinforcement learning). Even after the model has already been thoroughly fine-tuned, this step can still significantly improve accuracy. Previously, reinforcement learning was typically used only for language tasks that require reasoning, and it was an unexpected gain that it also works for perceptual tasks of the kind “look at the picture and click where to go.” To address the problem that buttons are too small on 4K high-resolution screens (a button may occupy only 0.07% of the screen area), the team resized the screenshots proportionally during training and pasted them onto a large white canvas to simulate real scenarios where elements are extremely tiny on high-resolution displays. This trick is especially effective in complex professional software such as Photoshop.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin