ME News Report, April 18 (UTC+8), according to Beating Monitoring, xAI has launched two independent audio APIs: Grok Speech to Text and Grok Text to Speech. Both come from the same audio stack supporting Grok Voice, Tesla's in-car system, and Starlink customer service, now available as standalone endpoints, allowing developers to directly integrate into voice agents, real-time transcription, accessibility tools, and podcasts. STT offers two modes. The REST API is used for batch transcription of large audio files with millisecond-level response; the WebSocket API is designed for real-time audio streams. Features include word-level timestamps, speaker diarization, multi-channel recognition, and Inverse Text Normalization, which automatically formats spoken numbers, dates, and currencies into standardized structured text. Supported languages exceed 25, with seamless switching during conversations. xAI also released a comparison of word error rates (WER, lower is better): overall Grok 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%; the gap in "telephone call entity recognition" is even larger, with Grok at 5.0%, compared to the other three at 12.0%, 13.5%, and 21.3%. In common business scenarios such as meetings, video podcasts, and phone calls, Grok also maintains a slight lead. These figures are self-tested and published by xAI, with no third-party verification yet. In terms of pricing, STT batch processing is $0.10 per hour, streaming is $0.20 per hour; TTS costs $4.20 per million characters. TTS supports inline Speech Tags to control emotion and prosody, such as [laugh], [sigh], [whisper], (Source: BlockBeats).

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

12 Likes

Reward
12
5
Repost
Share

Comment

Add a comment

Post-RainTvl

· 1h ago

Elon Musk is playing a big game; the infrastructure development of xAI is progressing faster than expected.

View OriginalReply0

TacoTreasury

· 5h ago

The Grok Voice layout is quite deep; both in-car and satellite customer service use the same setup, and the stability should be solid.

View OriginalReply0

GateUser-7cb48814

· 6h ago

WebSocket real-time transcription, directly usable for live captioning scenarios

View OriginalReply0

TheProphetOfToast

· 6h ago

Tesla's onboard system is the same source, and the vehicle voice interaction ecosystem may need to be integrated.

View OriginalReply0

GateUser-e4fb1fbe

· 6h ago

The same audio stack supports so many scenarios; engineering reuse is done beautifully.

View OriginalReply0

Trending Topics
View More
#
StockTradingChallengeUpTo17000U
16.22M Popularity
#
TrumpBacksCFTCAuthorityOverPredictionMarkets
815.4K Popularity
#
GatePredictionMarketAddsSmartMoneyTracking
13.2M Popularity
#
MicronMarketCapBreaks1Trillion
35.91K Popularity
#
TradeCFDWinGold
3.08M Popularity

Pinned

Sitemap

xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

Trending Topics

StockTradingChallengeUpTo17000U

TrumpBacksCFTCAuthorityOverPredictionMarkets

GatePredictionMarketAddsSmartMoneyTracking

MicronMarketCapBreaks1Trillion

TradeCFDWinGold

Pinned