xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

robot
Abstract generation in progress

ME News, April 18 (UTC+8): According to Beating Monitoring, xAI has launched two separate audio APIs—Grok Speech to Text and Grok Text to Speech. Both are built on the same audio stack that supports Grok Voice, Tesla’s in-car system, and Starlink customer service. This time, the capabilities are released as standalone endpoints, enabling developers to directly integrate applications such as voice agents, real-time transcription, accessibility tools, and podcasts.

STT offers two modes. The REST API is for batch transcription of large audio files, returning results at millisecond-level speed; the WebSocket API is designed for real-time speech streams. Included capabilities include word-level timestamps, speaker diarization, multi-channel recognition per channel, and Inverse Text Normalization—automatically converting spoken numbers, dates, and currencies into standardized structured text. Language coverage exceeds 25 languages, with seamless switching during conversations.

xAI also released a set of word error rate (WER) comparisons (the lower the better): for the overall scenario, Grok is 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, and AssemblyAI 12.9%. The gap is even wider for “telephone call entity recognition”: Grok is 5.0%, while the corresponding figures for the other three are 12.0%, 13.5%, and 21.3% respectively. In typical business scenarios such as meetings, video podcasts, and phone calls, Grok also maintains a slight lead. These figures are self-tested and published by xAI, and there has been no third-party re-testing yet.

In terms of pricing, STT batch is $0.10 per hour, and streaming is $0.20 per hour; TTS costs $4.20 per 1 million characters. TTS supports inline Speech Tags to control emotion and intonation, such as \[laugh\], \[sigh\], and \[whisper\], and \[...\].

(Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 10
  • 3
  • Share
Comment
Add a comment
Add a comment
InstantNoodle-LevelResearcher
· 5h ago
Batch transcription uses REST, real-time uses WebSocket; the architecture design is quite practical.
View OriginalReply0
GateUser-f85bc167
· 5h ago
Just wait for a real benchmark score; checking out xAI's benchmark will do.
View OriginalReply0
MarginMom
· 6h ago
Grok TTS supports emotional tags, which is pretty interesting. Will AI voiceovers be able to add crying tones in the future?
View OriginalReply0
GateUser-f92ba9fa
· 6h ago
25+ language coverage, has anyone tested how the Chinese version performs?
View OriginalReply0
Lightning-FastComposure
· 6h ago
What is reverse text normalization, a cutting-edge technology? Can someone knowledgeable explain it in detail?
View OriginalReply0
HaiyanColdWallet
· 6h ago
Word-level timestamps + speaker separation, meeting notes enthusiasts rejoice
View OriginalReply0
QuantsAndCats
· 6h ago
Is a TTS costing $4.2 per million characters cheaper or more expensive than ElevenLabs?
View OriginalReply0
AmberTeaSwirl
· 6h ago
Streaming STT $0.2/hour, should be able to run in real-time captioning scenarios
View OriginalReply0
MultisigOnRocks
· 6h ago
Feeding the same audio stack to Grok Voice, Tesla, and Starlink, Musk’s ecosystem has formed a closed-loop.
View OriginalReply0
BalanceScreenshotAfterTheRain
· 6h ago
xAI's audio API came a bit suddenly; is a STT pricing of $0.10/hour considered reasonable?
View OriginalReply0
View More
  • Pinned