On April 23rd, local time, OpenAI officially released the new flagship model GPT-5.5, which is positioned as a “completely new intelligence level for real-world work” by the company, marking an important step toward a new way of computing work.

The core focuses of this release are two points:

First is a breakthrough in efficiency: with the same latency, the model is larger, yet the speed remains unchanged. GPT-5.5’s context window reaches 1 million tokens, but it is not just a simple upgrade of GPT-5.4’s capabilities; instead, it achieves higher intelligence at the same latency through efficiency improvements.

Second, during training, GPT-5.5 participated in optimizing its own inference infrastructure. In short, AI has learned to help tune its own parameters for the first time.

In the Terminal-Bench 2.0 test for complex command-line workflows, GPT-5.5 scored 82.7%, surpassing Claude Opus 4.7’s 69.4% by over 13 percentage points; in the OSWorld-Verified test for AI independently operating real computers, the success rate was 78.7%, exceeding human baseline; in the GDPval test covering 44 professional knowledge tasks, 84.9% of tasks reached or exceeded industry expert levels.

However, the price of GPT-5.5 has also increased significantly.

The API pricing is $5 per million tokens for input and $30 for output, doubling the cost of GPT-5.4 (which is $2.50 per million input tokens and $15 per output). But the official emphasizes that GPT-5.5 requires significantly fewer tokens to complete the same tasks, so overall costs may not increase substantially. The GPT-5.5 Pro API is priced at $30 per million input tokens and $180 per million output tokens. Bulk processing and flexible pricing enjoy half-price discounts, with priority processing at 2.5 times the standard price.

In ChatGPT, GPT-5.5 is launched as “GPT-5.5 Thinking,” gradually replacing previous versions.

A new small feature is that before the model begins thinking, it first provides a brief overview of its reasoning, allowing users to interject at any time during execution to adjust the direction.

To summarize GPT-5.5 in one sentence: past models are collections of capabilities; GPT-5.5 is closer to a working system that can plan, check, and continuously advance.

01 84.9% of Tasks Achieve Professional Level

Comparison of GPT-5.5 with competitors on core benchmarks like Terminal-Bench 2.0, GDPval, and OSWorld-Verified

First, let’s look at how the evaluation models perform in real professional scenarios. OpenAI used a benchmark called “GDPval,” which requires models to complete a full set of professional tasks. The test covers 44 occupational scenarios, including financial modeling, legal analysis, data science reports, operational planning, and more.

Results show that GPT-5.5 achieves or surpasses industry professional levels in 84.9% of tasks. In comparison, GPT-5.4 is at 83.0%, Claude Opus 4.7 at 80.3%, and Gemini 3.1 Pro only at 67.3%.

This gap is not only reflected in the overall scores. In spreadsheet modeling tasks, GPT-5.5’s internal test scored 88.5%; it also leads in investment banking-level modeling tasks. Early testers’ feedback is quite consistent: responses from GPT-5.5 Pro show significant improvements over GPT-5.4 Pro in comprehensiveness, structure, and practicality, especially in business, legal, education, and data science fields.

Looking at the numbers alone can be numbing, so OpenAI directly pulled back the curtain on their own workspace for a look.

OpenAI states that over 85% of their internal staff use Codex weekly, across departments like finance, communications, marketing, product, and data science. The communications team used it to analyze six months of speech invitation data, building an automated classification process; the finance team used it to review 1M-1 tax forms totaling 71,637 pages, completing the task two weeks ahead of schedule last year; the marketing team relies on automated weekly reports, saving each person 5 to 10 hours per week.

This is no longer a lab demo; it has become part of daily work routines.

02 The Most Powerful Autonomous Programming Model

OpenAI claims that GPT-5.5 is currently their strongest autonomous programming model.

On Terminal-Bench 2.0 (testing complex command-line workflows requiring planning, iteration, and tool coordination), GPT-5.5 scored 82.7%, compared to GPT-5.4’s 75.1%, an improvement of nearly 8 percentage points, with less token consumption. On SWE-Bench Pro (evaluating real GitHub problem-solving ability in a one-shot manner), GPT-5.5 scored 58.6%. In internal Expert-SWE assessments (long-term programming tasks with a median human completion time of about 20 hours), GPT-5.5 also outperformed GPT-5.4.

Terminal-Bench 2.0 and Expert-SWE scatter plots

Driven by Codex, GPT-5.5 can now independently complete the entire development process—from code generation and functionality testing to visual debugging—starting from a single prompt.

OpenAI’s official demo shows that a space mission application built on NASA’s real orbital data supports 3D interactive control, with orbital mechanics simulation reaching real physical accuracy; a seismic tracker connected to real-time data sources and visualized, demonstrating the model’s full ability to call external APIs, handle dynamic data, and render in real time.

Regarding user feedback: Dan Shipper, founder and CEO of Every, shared an experience: he once encountered a bug after launch, spent several days trying to fix it himself, and finally had to ask the company’s top engineer to rewrite part of the system. After GPT-5.5 was released, he ran an experiment—restoring the model to the bug state before it was fixed to see if it could come up with the same solution as the engineer. GPT-5.4 failed; GPT-5.5 succeeded. He commented, “This is the first programming model I’ve used that truly has conceptual clarity.”

A more straightforward evaluation from a Nvidia engineer: “Losing access to GPT-5.5 feels like losing a limb.”

Michael Truell, co-founder and CEO of Cursor, added: GPT-5.5 is smarter and more resilient than GPT-5.4, capable of sticking with long, complex tasks longer without stopping prematurely—which is exactly what engineering work needs.

03 Knowledge Work: AI’s First True “Use” of Computers

In the OSWorld-Verified test (evaluating whether models can independently operate real computer environments), GPT-5.5 achieved a success rate of 78.7%, higher than GPT-5.4’s 75.0% and Claude Opus 4.7’s 78.0%.

This is not just screenshot analysis but actual screen control: seeing interfaces, clicking, inputting, switching between tools until the task is completed. GPT-5.5 makes it possible for AI to truly work alongside you on the same computer.

Demo video of financial modeling

In the Tau2-bench test for telecom customer service workflows, GPT-5.5 achieved an accuracy of 98.0% without prompt tuning, compared to 92.8% for GPT-5.4.

This indicates the model’s understanding of task intent is sufficiently deep that it can handle complex multi-step dialogues without carefully crafted prompts.

In terms of tool search ability, GPT-5.5 scored 84.4% on BrowseComp, with GPT-5.5 Pro reaching 90.1%, demonstrating strong continuous retrieval and information integration capabilities in research tasks that require cross-referencing multiple sources.

04 Scientific Research: Assisting in Discovering New Mathematical Proofs

Among this release’s highlights, GPT-5.5’s performance in scientific research may be the most surprising.

In the past, AI in research was mostly seen as an “assistive tool”—for literature review, coding, data organization. But this time, its role has moved forward significantly, participating in core processes: complex reasoning and even discovery itself.

On GeneBench (a multi-stage data analysis benchmark for genetics and quantitative biology), GPT-5.5 scored 25.0%, compared to GPT-5.4’s 19.0%. These tasks typically require days of work by scientific experts, involving reasoning about potentially erroneous data, hidden confounders, and applying modern statistical methods with minimal supervision.

The graph shows that as output tokens increase, GPT-5.5’s score improvement always outpaces GPT-5.4’s, with a clear gap emerging around 15,000 tokens—indicating that for long, deep reasoning tasks, GPT-5.5’s advantage grows with task complexity.

On BixBench (a real-world bioinformatics and data analysis benchmark), GPT-5.5 scored 80.5%, leading GPT-5.4’s 74.0%, ranking among the top models.

A particularly noteworthy case is a custom tool framework integrated into GPT-5.5’s internal version, which helped discover a new proof related to Ramsey numbers, verified within the formal proof assistant Lean. Ramsey numbers are a core object of study in combinatorics, with rare and highly challenging results. This is not just AI providing code or explanations but genuinely contributing a mathematical proof.

In practical applications, Jackson Laboratory immunologist Derya Unutmaz used GPT-5.5 Pro to analyze a gene expression dataset with 62 samples and nearly 28,000 genes, generating a detailed research report and key findings—work that would normally take months for a team.

Bartosz Naskręcki, assistant professor at Adam Mickiewicz University in Poznań, built an algebraic geometry application in just 11 minutes using GPT-5.5 via Codex, visualizing the intersection of two quadrics and converting the resulting curve into a Weierstrass model. The real-time displayed equations can be directly used for further mathematical research, from prompt to executable research tool, all independently completed by the model.

Screenshot of the algebraic geometry app built by Professor Naskręcki—visualizing quadric intersections and real-time Weierstrass equation computation

Brandon White, co-founder of Axiom Bio, commented more directly: “If OpenAI keeps this momentum, the foundation for drug discovery will be fundamentally changed by the end of the year.”

05 Reasoning Efficiency: AI’s First Self-Optimizing Infrastructure

A subtle but potentially most significant technical advance in this release is often overlooked.

GPT-5.5 is a larger, more powerful model, yet its per-token latency in real service remains on par with GPT-5.4. To maintain the same latency with increased capability, OpenAI redesigned the entire inference system—and Codex and GPT-5.5 directly participated in this optimization.

The Artificial Analysis intelligence index chart clearly illustrates this: the horizontal axis is total output tokens (log scale), the vertical axis is overall intelligence score. GPT-5.5’s curve not only surpasses GPT-5.4, Claude Opus 4.7, and Gemini 3.1 Pro Preview in score but, more importantly, reaches the same performance level at fewer tokens—indicating higher efficiency and lower cost, a direct reflection of “efficiency gains.”

Artificial Analysis intelligence index line chart

Specifically, the challenge was load balancing: previously requests were split into fixed chunks to balance GPU workload, but static chunking was suboptimal for all traffic patterns. Codex analyzed weeks of production traffic data and developed a custom heuristic algorithm, increasing token generation speed by over 20%.

GPT-5.5 is designed in collaboration with NVIDIA’s GB200 and GB300 NVL72 systems, for joint design, training, and deployment. In other words, this generation of models has helped optimize its own inference architecture—this is not just metaphorical; it’s literally “AI improving its own system.”

06 Cybersecurity: Capabilities Improved, Controls Tightened

GPT-5.5 shows clear improvements in cybersecurity capabilities. In CyberGym testing, GPT-5.5 scored 81.8%, GPT-5.4 79.0%, and Claude Opus 4.7 73.1%. In internal Capture The Flag (CTF) challenge tasks, GPT-5.5 scored 88.1%, GPT-5.4 83.7%.

CyberGym bar chart and CTF challenge scatter plot

OpenAI rates GPT-5.5’s cybersecurity and bio/chemical capabilities as “high” in the emergency preparedness framework, not yet “critical,” but with clear improvements over previous generations. They also admit that the newly deployed stricter risk classifiers “may initially cause some inconvenience for certain users,” and will continue to adjust.

To balance defense needs and access restrictions, OpenAI launched the “Cybersecurity Trusted Access” program: qualified security researchers and critical infrastructure defenders can apply for more relaxed access to use advanced cybersecurity capabilities with less friction.

Underlying this logic is that capabilities in cybersecurity and even biology-related fields tend to diffuse almost irreversibly. Instead of trying to restrict everyone’s use entirely, a different approach is to prioritize those who are truly engaged in defense—giving them access to the most advanced tools first. In short, it’s not a question of “whether to open,” but “who gets to use first.”

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
Gate13thAnniversaryLive
1.28M Popularity
#
WCTCTradingChallengeShare8MUSDT
832K Popularity
#
CryptoMarketSeesVolatility
202.11K Popularity
#
rsETHAttackUpdate
76.96K Popularity
#
US-IranTalksStall
488.78K Popularity

Sitemap

A single article to understand GPT-5.5: Starting today, OpenAI is "not selling" tokens anymore

Trending Topics

Gate13thAnniversaryLive

WCTCTradingChallengeShare8MUSDT

CryptoMarketSeesVolatility

rsETHAttackUpdate

US-IranTalksStall

Pin