OpenAI makes the model "speak out" and insults that AI is very expensive

Author: Su Yang, Tencent Technology

On May 8th, OpenAI added three new next-generation speech models to their API: GPT‑Realtime‑2, focused on speech reasoning and dialogue, Realtime‑Translate, emphasizing real-time multilingual translation, and Realtime‑Whisper, dedicated to speech-to-text conversion.

GPT‑Realtime‑2 is OpenAI’s first speech model with reasoning capabilities comparable to GPT‑5. It demonstrated significant improvements in benchmark tests: achieving an accuracy of 96.6% in the Big Bench Audio speech intelligence assessment, and an average pass rate of 48.5% in the Audio MultiChallenge instruction-following evaluation, representing increases of 15.2 and 13.8 percentage points over the previous generation GPT‑Realtime‑1.5.

Based on GPT‑Realtime‑2, speech AI has evolved from simple turn-based Q&A to a form capable of continuously listening, reasoning, calling tools, and completing tasks during a conversation.

A “Thinking” Speech Assistant

GPT‑Realtime‑2 aims to enable speech models to maintain conversational fluency while possessing reasoning and action capabilities necessary for handling complex tasks.

To improve naturalness in dialogue, the model introduces a leading prompt mechanism.

Developers can enable short prompts like “Let me check” or “Hold on, I’m looking into it,” to inform users that their request has been received and is being processed before generating a formal response.

Along with this, parallel tool invocation and transparency capabilities are supported, allowing the model to call multiple external tools simultaneously and inform users of progress via speech, such as saying “Checking your calendar” or “Looking it up,” so that the agent remains responsive rather than silent during task execution.

When encountering difficulties, the model will proactively give hints like “I’m having a bit of trouble now” and attempt to recover, rather than silently failing or abruptly ending the conversation.

Additionally, the model’s context window has expanded from 32K to 128K, enabling it to maintain coherence over longer, more complex multi-turn conversations, supporting more complete agent workflows.

In terms of professional application, the model has enhanced understanding of domain-specific terminology, more accurately retaining technical terms, proper nouns, and medical jargon, which is especially valuable for deployment in production environments. In expression, the model offers more controllable tone and style, capable of switching according to context.

Another key upgrade is adjustable reasoning strength. Developers can choose from five levels: minimal, low, medium, high, and xhigh (default is low), balancing latency and reasoning depth.

No Small Talk

GPT‑Realtime‑2 outperforms its predecessor in benchmark tests

In the Big Bench Audio challenge, which tests challenging reasoning abilities in speech models, GPT‑Realtime‑2 (at high reasoning level) achieved an accuracy of 96.6%, compared to GPT‑Realtime‑1.5’s 81.4%, a 15.2 percentage point increase.

In the Audio MultiChallenge, which evaluates multi-turn interaction intelligence in spoken dialogue systems—covering instruction following, context integration, self-consistency, and natural speech correction—GPT‑Realtime‑2 (xhigh reasoning level) saw its average pass rate jump from 34.7% with GPT‑Realtime‑1.5 to 48.5%, a 13.8 percentage point improvement.

In fact, the most convincing scenario to determine if a speech model is truly “smart” isn’t casual chatting, but handling complex problems that require layered reasoning.

Note: OpenAI provided a specific test in their demo documentation: a user describing their startup, with speech reasoning and corresponding transcripts from two generations of Realtime models.

This case involves a highly complex, reasoning-intensive task: the model must understand multiple variables’ relationships, such as uneven customer flow distribution over time, high fixed rent costs, and a business model focused on slow turnover coffee with low table turnover, all under these constraints, performing logical deduction.

GPT‑Realtime‑2 took 1 minute and 4 seconds to produce a well-structured, layered response, not only analyzing the conflict between customer flow and rent structure, pointing out that peak times being too concentrated could lead to insufficient overall space efficiency to cover rent, but also proposing specific lightweight testing paths.

The same question posed to the previous model GPT‑Realtime‑1.5 took 51 seconds, but with noticeably less depth. This direct comparison clearly demonstrates the generational gap in strategic reasoning capabilities.

03 Real-time Translation and Transcription

Besides GPT‑Realtime‑2, OpenAI also released two models tailored for specific scenarios.

GPT‑Realtime‑Translate focuses on real-time multilingual translation, supporting over 70 input languages, with real-time output in 13 target languages, and providing transcription text simultaneously. Its target applications include customer support, cross-border sales, education, events, and creator platforms reaching a global audience.

Alberto Parravicini, AI lead at Vimeo, shared their use case: embedding GPT‑Realtime‑Translate during video playback, enabling creators to communicate across languages instantly with viewers worldwide.

Vimeo demonstrates GPT‑Realtime‑Translate’s real-time translation capability

GPT‑Realtime‑Whisper is a streaming speech-to-text model, designed for low-latency transcription scenarios.

It can start generating text as soon as the speaker begins talking, suitable for live captions in meetings, classroom notes, broadcast subtitles, and speech interaction scenarios requiring immediate downstream workflow generation. Its core value lies in converting speech into structured text that downstream systems can use immediately during ongoing conversations.

Safety and Pricing

On the safety front, Realtime API incorporates multiple safeguards—built-in active classifiers monitor conversations in real-time, terminating sessions if harmful content guidelines are violated. Developers can also easily overlay custom safety filters using the Agents SDK.

OpenAI’s usage policies explicitly prohibit outputs used for spam, fraud, or other harmful purposes.

According to official guidelines, unless the interaction context clearly indicates the user is talking to an AI, developers must disclose to end users that they are interacting with artificial intelligence (prompting: “The AI is speaking now”). Additionally, the API fully supports EU data residency for EU customers and is protected by enterprise privacy commitments.

All three models are now available to developers via the Realtime API.

Pricing-wise, GPT‑Realtime‑2 charges per speech token, at $32 per million input tokens (with cached inputs at $0.40 per 32k tokens), and $64 per million output tokens. GPT‑Realtime‑Translate is billed by usage time at $0.034 per minute. GPT‑Realtime‑Whisper also charges by time, at $0.017 per minute.

To endorse the new “speech ecosystem,” OpenAI CEO Sam Altman said on X: “People are really starting to interact with AI via speech, especially when they need to convey a lot of background information at once.”

He also mentioned that younger users seem to prefer communicating with AI through speech, while older users tend to type, raising an open question about whether this habit might change in the future.

The question is: now that OpenAI’s speech reasoning capabilities have advanced, who will be the next in line?

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin