According to Beating reports, the open-source Phi-Ground model can output precise click coordinates after taking a screenshot and adding instruction input. The 4-billion-parameter version, guided by instruction planning, outperforms OpenAI Operator and Claude Computer Use on benchmarks such as Showdown, and leads comparable models across multiple evaluations. The team validated using 40 million data points and found that it works best to write the coordinates directly as plain numbers, and to input the text instructions before the image to achieve one-way visual reading. They also improved performance on purely visual tasks through DPO reinforcement learning, and in high-score-screen scenarios used a training method of pasting scaled-down screenshots onto a white canvas, with notable results in applications such as Photoshop and similar scenarios.

BlockBeatNews

2026-05-10 04:21:00

Abstract generation in progress

According to Beating Monitoring, Microsoft has open-sourced the Phi-Ground model family, which is designed specifically to solve the problem of “where on the screen” when AI controls a computer. With a screenshot and an instruction, the model outputs precise click coordinates. After pairing the open-source 4 billion parameter version with a large model for instruction planning, it achieved click accuracy above OpenAI Operator and Claude Computer Use on the Showdown benchmark, and took first place across all five evaluations, including ScreenSpot-Pro, for models below 4 billion parameters.

The team conducted large-scale validation using more than 40 million data points and found that the three training techniques commonly used in previous academic papers all failed when the dataset size was increased. The truly effective method is simple: output coordinates as ordinary numbers directly, such as “523, 417.” Several prior papers invented a special vocabulary of position words for coordinates, hoping the model would speak coordinates the way it speaks words, but with large-scale training these new tokens could not be learned properly and instead caused the model to collapse. Another key is to input the text instruction before the image. Large models process information in one direction: once they read “click the blue settings icon” and only then look at the image, when handling pixels they already know what to look for; conversely, if they look at the image first, the model can only scan blindly, resulting in much worse performance.

The team also found that reinforcement learning is useful even for purely visual tasks. The specific approach is to have the model make multiple click predictions on the same image, then select correct versus incorrect point outcomes for comparison-based training (this method is called DPO, which is a type of reinforcement learning). Even after the model has already been thoroughly fine-tuned, this step can still significantly improve accuracy. Previously, reinforcement learning was typically used only for language tasks that require reasoning, and it was an unexpected gain that it also works for perceptual tasks of the kind “look at the picture and click where to go.” To address the problem that buttons are too small on 4K high-resolution screens (a button may occupy only 0.07% of the screen area), the team resized the screenshots proportionally during training and pasted them onto a large white canvas to simulate real scenarios where elements are extremely tiny on high-resolution displays. This trick is especially effective in complex professional software such as Photoshop.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
1.06M Popularity
#
BTCBackAbove80K
59.45M Popularity
#
JapanTokenizesGovernmentBonds
1.91M Popularity
#
DailyPolymarketHotspot
871.28K Popularity
#
WCTCTradingKingPK
766.46K Popularity

Sitemap

Microsoft open-sources Phi-Ground: 4 billion parameters achieve click accuracy better than Operator and Claude

Trending Topics

GateSquareMayTradingShare

BTCBackAbove80K

JapanTokenizesGovernmentBonds

DailyPolymarketHotspot

WCTCTradingKingPK

Pin