xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

robot
Abstract generation in progress

ME News, April 18 (UTC+8), according to Beating Monitoring, xAI has launched two independent audio APIs: Grok Speech to Text and Grok Text to Speech. Both come from the same audio stack that supports Grok Voice, Tesla’s in-car system, and Starlink customer service. This time, they are opened up in the form of standalone endpoints, enabling developers to directly integrate applications such as voice agents, real-time transcription, accessibility tools, and podcasts.

STT offers two modes. The REST API is for batch transcription of large audio files, with millisecond-level return; the WebSocket API is designed for real-time audio streams. Included capabilities cover word-level timestamps, speaker separation (diarization), multi-channel recognition performed separately for each channel, and Inverse Text Normalization, which automatically reshapes spoken numbers, dates, and currencies from conversational speech into standardized structured text. The language coverage spans 25+ languages and allows seamless switching within conversations.

xAI also released a set of word error rate (WER) comparisons (the lower the better): overall scenario—Grok 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%. The gap is even larger for “telephone call entity recognition”: Grok 5.0%, with the corresponding three at 12.0%, 13.5%, and 21.3%, respectively. In common business scenarios such as meetings, video podcasts, and phone calls, Grok is also slightly ahead across the board. These figures are published by xAI based on its own testing, and there has not yet been any third-party re-verification.

On pricing: STT batch processing is $0.10 per hour, streaming is $0.20 per hour; TTS is $4.20 per 1 million characters. TTS supports inline Speech Tags to control emotion and intonation, such as \[laugh], \[sigh], \[whisper], and \[.] (Source: BlockBeats)

XAI1.72%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 7
  • Repost
  • Share
Comment
Add a comment
Add a comment
GateUser-7919e6b9
· 1h ago
STT in bulk costs only $0.10/hour, which is cheaper than the Whisper API.
View OriginalReply0
GateUser-28f37882
· 3h ago
The same stack feeds Grok Voice, in-car systems, Starlink, and this wave of resource integration with xAI has some substance.
View OriginalReply0
Don'tMessWithSlippage.
· 3h ago
Grok's audio stack is finally open to the public, Tesla owners are ecstatic
View OriginalReply0
ReflectiveChainShadow
· 3h ago
WebSocket real-time streaming costs $0.2 per hour—can it work smoothly in live captioning scenarios?
View OriginalReply0
MossyLedger
· 3h ago
WER comparison without third-party re-testing, let the bullets fly for a while first.
View OriginalReply0
MistBlueLily
· 3h ago
Reverse text normalization is so useful for building voice assistants; I finally don't have to write rules myself.
View OriginalReply0
NodeUnderTheAurora
· 3h ago
4.2 “per million characters” TTS—cheaper or more expensive than ElevenLabs? Has anyone figured it out?
View OriginalReply0