xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

robot
Abstract generation in progress

ME News report, April 18 (UTC+8), according to Dongcha Beating Monitoring, xAI has launched two independent audio APIs: Grok Speech to Text and Grok Text to Speech. Both come from the same audio stack that supports Grok Voice, Tesla’s in-car system, and Starlink customer service. This time, they are opened up in the form of standalone endpoints, allowing developers to directly connect voice agents, real-time transcription, accessibility tools, and podcast-style applications.

STT offers two modes. The REST API is used for batch transcription of large audio files, with millisecond-level responses; the WebSocket API is designed for real-time audio streams. Included capabilities include word-level timestamps, speaker diarization, separate recognition for multiple channels, and Inverse Text Normalization, which automatically converts spoken numbers, dates, and currencies into standardized structured text. The system covers 25+ languages and allows seamless switching during conversations.

xAI also released a set of word error rate (WER; the lower the better) comparisons: for the overall scenario, Grok is 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, and AssemblyAI 12.9%. The gap is even larger for “telephone call entity recognition”: Grok is 5.0%, compared with 12.0%, 13.5%, and 21.3% for the other three. In common business scenarios such as meetings, video podcasts, and phone calls, Grok also maintains a slight lead. These figures are published based on xAI’s own testing, with no third-party re-verification yet.

In terms of pricing, batch STT is $0.10 per hour, and streaming is $0.20 per hour. TTS costs $4.20 per 1 million characters. TTS supports controlling emotion and prosody using inline Speech Tags, such as \[laugh\], \[sigh\], \[whisper\], and others. (Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 6
  • Repost
  • Share
Comment
Add a comment
Add a comment
GovernanceVotingTug-Of-WarKing
· 2h ago
25+ language coverage is decent, but the quality of less common languages is questionable; only by trying will you know.
View OriginalReply0
ViewingBullAndBearMarketsFromA
· 2h ago
WebSocket real-time stream latency in milliseconds is not specified, which is critical in live broadcast scenarios.
View OriginalReply0
BorrowedHalo
· 2h ago
Emotional tags embedded, will AI podcasts be able to detect sarcasm in the future?
View OriginalReply0
PuddingMarketMaker
· 2h ago
Starlink customer service is already all set up—Musk’s ecosystem closed loop is effectively proven.
View OriginalReply0
GateUser-83c80dd0
· 2h ago
Word-level timestamps + speaker separation, podcast editing enthusiasts rejoice
View OriginalReply0
TideEarningsTable
· 2h ago
4.2 million characters in USD, is it cheaper or more expensive than ElevenLabs? Has anyone calculated it?
View OriginalReply0