xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

robot
Abstract generation in progress

ME News message, April 18 (UTC+8). According to Beating monitoring, xAI has launched two standalone audio APIs: Grok Speech to Text and Grok Text to Speech. Both come from the same audio stack that supports Grok Voice, Tesla’s in-car system, and Starlink customer service. This time, they are opened as independent endpoints, allowing developers to directly integrate into applications such as voice agents, real-time transcription, accessibility tools, and podcasts.

STT provides two modes. The REST API is for batch transcription of large audio files, returning results at millisecond-level latency; the WebSocket API is designed for real-time audio streams. Included capabilities include word-level timestamps, speaker separation (diarization), multi-channel recognition, and Inverse Text Normalization, which automatically transforms spoken numbers, dates, and currencies into standardized structured text. Language coverage exceeds 25, with seamless switching during conversations.

xAI also released a set of word error rate (WER, lower is better) comparisons. In overall scenarios: Grok 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%. The gap is even larger in “telephone call entity recognition”: Grok 5.0%, while the corresponding figures for the other three are 12.0%, 13.5%, and 21.3% respectively. In common business scenarios such as meetings, video podcasts, and phone calls, Grok also maintains a slight lead. These figures are published based on xAI’s own testing, and there has been no third-party re-verification yet.

In pricing, batch STT is $0.10 per hour, and streaming STT is $0.20 per hour; TTS is $4.20 per 1 million characters. TTS supports inline Speech Tags to control emotion and prosody, such as \[laugh\], \[sigh\], \[whisper\], and others.

(Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 13
  • 1
  • Share
Comment
Add a comment
Add a comment
GateUser-0f33f9ef
· 8h ago
What is reverse text normalization, a cutting-edge technology? Can someone knowledgeable explain it in detail?
View OriginalReply0
WhitepaperByTheRoadside
· 15h ago
Word-level timestamps + speaker separation, meeting transcription scenarios are going crazy.
View OriginalReply0
Lime-ColoredStop-LossLine
· 05-27 13:11
Batch processing $0.1 per hour is really attractive, but with streaming double pricing, it clearly forces you to go for bulk.
View OriginalReply0
GateUser-83a2dd8a
· 05-27 13:07
25+ language coverage, how is the Chinese performance? Has anyone tested it?
View OriginalReply0
TheProphetOfToast
· 05-27 11:44
Emotional rhythm inline tags, finally no need to listen to robots read scripts anymore
View OriginalReply0
GateUser-b665e41c
· 05-27 10:48
Tesla's in-car system integration, what is the maximum latency in milliseconds for voice interaction while driving?
View OriginalReply0
PunkRiskMgr
· 05-27 10:40
Starlink customer service is now in use, and rural area accents have become a rich training data resource for speech recognition.
View OriginalReply0
ToBeHonest,You'llLose
· 05-27 10:36
From LLMs to speech, the multimodal war officially enters the second half
View OriginalReply0
HashbrownHero
· 05-27 10:35
With this bulk transcription price, subtitle groups and podcast hosts will probably need to migrate collectively.
View OriginalReply0
GateUser-bee672a5
· 05-27 10:35
Waiting for an open-source community to reproduce WER, xAI's benchmark usually starts by questioning.
View OriginalReply0
View More
  • Pinned