xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

robot
Abstract generation in progress

ME News, April 18 (UTC+8): According to Beating monitoring, xAI has launched two independent audio APIs—Grok Speech to Text and Grok Text to Speech. Both come from the same audio stack that supports Grok Voice, Tesla’s in-car system, and Starlink customer service. This time, they are opened as standalone endpoints, enabling developers to directly integrate applications such as voice agents, real-time transcription, accessibility tools, and podcasts.

STT offers two modes. The REST API is used for batch transcription of large audio files, returning results at millisecond-level latency; the WebSocket API is designed for real-time audio streams. Included capabilities feature word-level timestamps, speaker diarization, multi-channel recognition with each channel recognized separately, and Inverse Text Normalization, which automatically formats spoken numbers, dates, and currencies into standardized structured text. The system supports more than 25 languages and can switch seamlessly during conversations.

xAI also released a set of word error rate (WER) comparisons (lower is better): overall in-scene performance—Grok 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%. The gap is even larger for “telephone call entity recognition”: Grok 5.0%, while the corresponding figures for the other three are 12.0%, 13.5%, and 21.3%, respectively. In common business scenarios such as meetings, video podcasts, and phone calls, Grok also holds a slight lead.

These figures are self-tested and published by xAI, and there has been no third-party re-testing yet. In terms of pricing, STT batch processing is $0.10 per hour, and streaming is $0.20 per hour; TTS is $4.20 per 1,000,000 characters. TTS supports inline Speech Tags to control emotion and intonation, for example [laugh], [sigh], and [whisper]. (Source: BlockBeats)

XAI-1.49%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 4
  • 2
  • Share
Comment
Add a comment
Add a comment
TwoFactorZen
· 7h ago
WebSocket real-time stream 0.2 (刀)/hour—brothers making live subtitles can do the math.
View OriginalReply0
Frost-ColoredCubeCity
· 10h ago
The batch pricing is okay, but the streaming double pricing strategy clearly forces you to go for bulk, it's an old trick.
View OriginalReply0
GateUser-517aed04
· 10h ago
The same stack fed to Tesla's car system + Starlink customer service, Musk is really good at playing this closed loop.
View OriginalReply0
GateUser-b6d80ba0
· 10h ago
WER talks big and acts on its own, moving ahead without waiting for third-party re-testing—old crypto hands know what that means.
View OriginalReply0