xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

robot
Abstract generation in progress

ME News update. On April 18 (UTC+8), according to Beating Monitoring, xAI has launched two standalone audio APIs: Grok Speech to Text and Grok Text to Speech. Both are from the same audio stack that supports Grok Voice, Tesla’s in-car system, and Starlink customer service. This time, they are opened up in the form of independent endpoints, allowing developers to directly integrate applications such as voice agents, real-time transcription, accessibility tools, and podcasts.

STT offers two modes. The REST API is for batch transcription of large audio files, with millisecond-level responses; the WebSocket API is designed for real-time audio streams. Included capabilities include word-level timestamps, speaker diarization, multi-channel recognition, and Inverse Text Normalization, which automatically converts spoken numbers, dates, and currencies into standardized structured text. The service covers 25+ languages and supports seamless switching within conversations.

xAI also released a set of word error rate (WER, lower is better) comparisons: overall Grok 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, and AssemblyAI 12.9%. The gap is even larger in “telephone call entity recognition”: Grok 5.0%, compared with 12.0%, 13.5%, and 21.3% for the other three. In common business scenarios such as meetings, video podcasts, and phone calls, Grok also maintains a slight lead. These figures were self-tested and published by xAI, and there has been no third-party re-testing yet.

In terms of pricing, batch STT is $0.10 per hour, and streaming STT is $0.20 per hour; TTS is $4.20 per 1 million characters. TTS supports controlling emotion and intonation using inline Speech Tags, such as \[laugh\], \[sigh\], and \[whisper\]. (Source: BlockBeats)

XAI-1.74%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 6
  • Repost
  • Share
Comment
Add a comment
Add a comment
SudoSatoshi
· 3h ago
Multilingual coverage of 25+ languages, but how is the quality for low-resource languages?
The average WER looks good, but long-tail languages might still be a disaster.
View OriginalReply0
AirdropUnderTheNeonBridge
· 3h ago
Emotion and prosody inline tags? TTS is finally no longer just a script-reading machine; it can now add flair when doing audiobooks or game NPC dialogues.
View OriginalReply0
AirdropCartographer
· 3h ago
Multi-channel + speaker separation, a meeting recording transcription tool, but with a streaming cost of $0.2 per hour, long meetings still add up to not being cheap.
View OriginalReply0
PerpPulse
· 3h ago
Grok Voice, Tesla’s in-car system, and Starlink customer service all use the same audio stack, and Musk’s ecosystem closed-loop is getting more and more slick.
View OriginalReply0
MintLaterMaybe
· 3h ago
What is reverse text normalization? Converting numbers to Arabic numerals? This feature is quite important for post-processing speech transcription, saving the trouble of writing regex yourself.
View OriginalReply0
CliffsideAncientPineAndRolling
· 3h ago
xAI's recent audio API combo punch is quite aggressive, with streaming STT at $0.2 per hour, TTS at $4.2 per million characters, and the pricing strategy clearly aimed at large-scale commercial use.
View OriginalReply0