According to Beating, Microsoft has recently open-sourced the Phi-Ground model family, aimed at solving the problem of "where AI should click on the computer screen." This 4 billion parameter version, combined with a larger language model used for instruction planning, outperformed OpenAI Operator and Claude Computer Use in click accuracy on the Showdown benchmark, and ranked first among all models with fewer than 10 billion parameters in five evaluations, including ScreenSpot-Pro. The team trained on over 40 million data samples and found that three common training techniques used in academic papers become ineffective at scale. The key insight proved to be simple: output coordinates as regular numbers, such as "523, 417." Previous research invented specialized vocabulary for coordinates, but these methods do not scale. The team also discovered that placing text instructions before images can improve performance because the model can recognize targets when processing pixels. Additionally, reinforcement learning methods like DPO can still improve accuracy after fine-tuning.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin