Microsoft releases the first 7B-parameter computer-controlled intelligent agent model Fara-7B

robot
Abstract generation in progress
AIMPACT News, May 16 (UTC+8), Microsoft released Fara-7B, its first small language model with 7 billion parameters specifically designed for computer usage scenarios. The model adopts a multimodal decoder architecture, capable of receiving screenshot images and text context, directly predicting parameterized chains of thought and operational actions. Built on Qwen 2.5-VL (7B), supporting a 128k context length, trained over 2.5 days on 64 H100 GPUs, and released under the MIT license on November 24, 2025. Fara-7B perceives browser input through screenshots, combining internal reasoning and historical state records to predict the next action and parameters (such as click coordinates). Training relies on a large-scale fully synthetic dataset. The model can plan and execute advanced tasks such as booking restaurants, applying for jobs, and planning trips. For safety alignment, it employs robust fine-tuning methods, has key point recognition capabilities, can refuse seven categories of policy-violating tasks, and pauses operations at critical stopping points such as inputting personal information or completing purchases. Users can deploy and interact via GitHub repositories, vllm, and fara-cli tools, mainly applied to automated web tasks. (Source: InFoQ)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 11
  • Repost
  • Share
Comment
Add a comment
Add a comment
MintCondition
· 3h ago
Post-training safety alignment + key point pause, this design approach clearly reflects lessons learned
View OriginalReply0
DepegDaydream
· 3h ago
Full synthetic data training creates a closed-loop for the data, so the cost of subsequent iterations will continue to decrease.
View OriginalReply0
BlueberryStakingMachine
· 4h ago
Handling both screenshots and text simultaneously, multimodality is finally not just a gimmick but a necessity.
View OriginalReply0
LatencyMonk
· 4h ago
Training 64 H100s for 2.5 days—this cost efficiency is lower than I expected.
View OriginalReply0
BridgeAnxiety
· 5h ago
Predicting coordinates and parameters is too critical; previously, with GPT-4V, I still had to do post-processing myself.
View OriginalReply0
YieldBento
· 5h ago
fara-cli directly interacts via command line, making tech enthusiasts ecstatic. I'll try it tomorrow.
View OriginalReply0
BluePeonyDoesn'tDrop
· 5h ago
Able to refuse violations and proactively pause, this safety alignment is more meticulous than some closed-source models.
View OriginalReply0
PurpleMistLily
· 5h ago
With 128k context and screenshot awareness, browser automation finally no longer requires writing a bunch of XPath.
View OriginalReply0
LonelyStoneUnderTheAurora
· 5h ago
MIT License means commercial use and modifications are allowed; domestic shell companies are ready.
View OriginalReply0
IdleFishDaoMember
· 5h ago
Qwen 2.5-VL base module + fully synthetic data—synthetic data pipelines are becoming increasingly mainstream.
View OriginalReply0
View More