xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

ME News Report, April 18 (UTC+8), according to Beating Monitoring, xAI has launched two standalone audio APIs: Grok Speech to Text and Grok Text to Speech. Both come from the same audio stack supporting Grok Voice, Tesla's in-car system, and Starlink customer service, now available as independent endpoints, allowing developers to directly integrate into voice agents, real-time transcription, accessibility tools, and podcasts. STT offers two modes. The REST API is used for batch transcription of large audio files with millisecond-level response; the WebSocket API is designed for real-time audio streams. Features include word-level timestamps, speaker diarization, multi-channel recognition, and Inverse Text Normalization, which automatically formats spoken numbers, dates, and currencies into standardized structured text. Supported languages exceed 25, with seamless switching during conversations. xAI also released a comparison of Word Error Rate (WER, lower is better): overall scene Grok 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%; the gap in "Telephone Conversation Entity Recognition" is even larger, with Grok at 5.0%, compared to 12.0%, 13.5%, and 21.3% for the other three. In common business scenarios like meetings, video podcasts, and phone calls, Grok also maintains a slight lead. These figures are self-tested and published by xAI, with no third-party verification yet. Pricing-wise, batch STT is $0.10 per hour, streaming is $0.20 per hour; TTS costs $4.20 per million characters. TTS supports inline Speech Tags to control emotion and prosody, such as [laugh], [sigh], [whisper], etc. (Source: BlockBeats)
XAI0.66%
GROK-5.84%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 7
  • Repost
  • Share
Comment
Add a comment
Add a comment
BudgetValidator
· 1h ago
Voice infrastructure begins standardization, benefiting small and medium developers
View OriginalReply0
MirrorBallGazingAtTheSky
· 3h ago
The same stack supports three scenarios; Elon Musk's reuse game is strong.
View OriginalReply0
AirdropSidequest
· 4h ago
WebSocket is suitable for streaming, REST is suitable for archiving, the design is reasonable
View OriginalReply0
CandlewickKid
· 8h ago
xAI has finally separated the voice stack, developers are ecstatic
View OriginalReply0
RetroRadioSignal
· 8h ago
Grok’s STT implements both REST and WebSocket dual modes, covering both batch and real-time needs—it’s quite detailed.
View OriginalReply0
PatinaTradingBell
· 8h ago
The audio stack used by both Tesla and Starlink should have proven its reliability.
View OriginalReply0
OracleBabysitter
· 8h ago
Accessibility tools +1, this is the warmth that technology should have
View OriginalReply0