GPT-5.4, « Agent natif » grand modèle arrive-t-il ?

robot
Création du résumé en cours

OpenAI finally figured it out.

Just two days after the rumors, on March 5th, local time, OpenAI officially launched GPT-5.4. This model update focuses on the hottest AI Agent direction right now.

Before GPT-5.4, the capabilities of large models could be summarized in one sentence: they can tell you “how to do it,” but they can’t do it themselves.

If you ask it to analyze competitors, it will give you a lengthy report; if you ask it to organize an Excel sheet, it will write a Python script for you to run; if you ask it to book a flight, it will guide you step by step on which website to visit and which buttons to click.

The middle wall is called “computer operation.”

GPT-5.4 is OpenAI’s first general model to break down this wall.

GPT-5.4 compared to previous models|Image source: OpenAI

It can recognize screen content via screenshots, send mouse and keyboard commands, and execute multi-step workflows across different applications. In OpenAI’s own words, this is their “most powerful and efficient frontier model for professional work to date.”

More technically, GPT-5.4 supports a context window of up to 1 million tokens and can call libraries like Playwright to directly control browsers and desktop applications.

This means it** no longer processes “task-related conversations,” but “the task itself.”**

01 OpenAI’s groundwork

If you’ve been following OpenAI’s recent moves, you’ll see that GPT-5.4 isn’t a sudden product emergence but a new step along a clear strategic line.

Just two weeks ago, OpenAI released GPT-5.3-Codex, upgrading Codex from “an agent that can write code” to “an agent capable of almost everything a developer does on a computer,” setting new industry benchmarks on SWE-Bench Pro and Terminal-Bench.

Meanwhile, OpenAI launched the enterprise-oriented “Frontier” platform, with HP, Intuit, and Uber already as early users.

GPT-5.4 is noticeably smarter at filling out spreadsheets than 5.2|Image source: OpenAI

Earlier, on March 2nd, OpenAI and AWS expanded their existing $3.8 billion partnership to over $100 billion, lasting 8 years, with AWS becoming the exclusive third-party cloud provider for the OpenAI Frontier platform. The scale of this investment itself is a signal.

The latest $110 billion funding round, supported by Amazon, SoftBank, and Nvidia, also closed around the same time.

This isn’t a company just “developing good products”; it’s a company sprinting to “win the enterprise AI Agent market.”

The native computer operation capabilities of GPT-5.4 are the key weapon in this sprint.

02 Is it really useful?

Demo videos at launch are always impressive, but the real question is about actual performance.

Financial tech company Walleye Capital reported in internal tests that GPT-5.4 improved accuracy by 30 percentage points in Excel financial modeling, significantly speeding up automated scenario analysis.

Talent assessment platform Mercor’s CEO called it “the best model we’ve tested,” showing outstanding performance in long-cycle tasks like slide creation, financial modeling, and legal analysis.

An independent developer who uses Codex daily gave a more down-to-earth review: “GPT-5.4 is my new daily driver in Codex. Its thinking is closer to humans, and it’s not as obsessed with technical details as 5.3.” But he also issued a warning: “Be careful, I’ve encountered several cases where the model executed tasks incorrectly but concealed the fact.

This detail is worth pondering.

Benchmark data also confirms this capability boost. Reports indicate that GPT-5.4’s performance on the GDPval benchmark surpasses 83% of average office workers. This number sounds explosive, but the real question isn’t “how many people it can outperform,” but “which tasks it can replace humans in.”

However, Dr. Jeff Dalton from the University of Edinburgh’s School of Informatics pointed out a practical issue — current demonstrations lack enough detailed evaluation evidence to support such grand claims. The capability is real, but the boundaries still need more independent validation.

03 The Agent battlefield has no safe zone

If GPT-5.4 represents OpenAI’s ambition for Agents, competitors are not idle.

Anthropic’s Claude 3.7 Sonnet launched the “Computer Use” feature as early as February this year, positioning it as a hybrid reasoning model designed for complex tasks.

Google’s Gemini 2.0 series also continues to develop “Agentic” capabilities, with Project Mariner already able to perform multi-step operations autonomously within Chrome.

But the fundamental difference between GPT-5.4 and its competitors is that it is OpenAI’s first product to embed computer operation capabilities directly into a general model — not a standalone tool, not an API requiring additional calls, but a model that inherently possesses this ability.

The term “native” in engineering terms means, simply put, lower latency, more natural task transitions, and less “glue code.” For enterprises eager to deploy Agents quickly, this difference directly impacts deployment costs.

OpenAI also announced that GPT-5.4 can directly connect to Microsoft Excel and Google Sheets, performing granular analysis and automation at the cell level. This step clearly targets the core of enterprise decision-making processes.

In the Agent arena, it’s never about who runs fastest, but who can embed themselves into enterprise workflows first, becoming an indispensable presence.

Tech launches are always passionate, but the real test comes on day 91 — when the hype fades, and users in real work scenarios open this tool. Can it reliably handle screenshots, accurately click buttons, quietly complete tasks, and deliver results?

The developer’s comment about “concealed errors” is the most cautionary note I’ve seen in this report so far.

The ceiling of AI Agent capabilities is never “what it can do,” but “whether you dare to trust it to do it.”

Trust is the real currency in this Agent war.

Voir l'original
Cette page peut inclure du contenu de tiers fourni à des fins d'information uniquement. Gate ne garantit ni l'exactitude ni la validité de ces contenus, n’endosse pas les opinions exprimées, et ne fournit aucun conseil financier ou professionnel à travers ces informations. Voir la section Avertissement pour plus de détails.
  • Récompense
  • Commentaire
  • Reposter
  • Partager
Commentaire
0/400
Aucun commentaire
  • Épingler