ME News reports that xAI has launched two standalone audio APIs: Grok STT and Grok TTS, originating from the same audio stack, supporting Grok Voice, Tesla in-car systems, and Starlink customer service, among others. STT offers REST batch transcription and WebSocket real-time streaming, with word-level timestamps, speaker separation, multi-channel support, and inverse text normalization, covering over 25 languages; TTS supports inline tags for emotion and prosody. They also announced WER comparisons, with Grok leading in multiple scenarios, with no third-party re-evaluation yet. Pricing: STT batch processing $0.10 per hour, streaming $0.20 per hour; TTS $4.20 per million characters.

MeNews

2026-05-26 06:16:48

Abstract generation in progress

ME News Report, April 18 (UTC+8), according to Beating Monitoring, xAI has launched two standalone audio APIs: Grok Speech to Text and Grok Text to Speech. Both come from the same audio stack supporting Grok Voice, Tesla's in-car system, and Starlink customer service, now available as independent endpoints, allowing developers to directly connect to voice agents, real-time transcription, accessibility tools, and podcasts. STT offers two modes. The REST API is used for batch transcription of large audio files with millisecond-level response; the WebSocket API is designed for real-time audio streams. Features include word-level timestamps, speaker diarization, multi-channel recognition, and Inverse Text Normalization, which automatically formats spoken numbers, dates, and currencies into standardized structured text. Supported languages exceed 25, with seamless switching during conversations. xAI also released a set of Word Error Rate (WER) comparisons: overall scene Grok 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%; the gap in "Phone Call Entity Recognition" is even larger, with Grok at 5.0%, compared to 12.0%, 13.5%, and 21.3% for the other three. In common business scenarios like meetings, video podcasts, and phone calls, Grok also maintains a slight lead. These figures are self-tested and published by xAI, with no third-party verification yet. In terms of pricing, batch STT is $0.10 per hour, streaming is $0.20 per hour; TTS costs $4.20 per million characters. TTS supports inline Speech Tags to control emotion and intonation, such as [laugh], [sigh], [whisper], etc. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

11 Likes

Reward
11
7
2
Share

Comment

Add a comment

SushiAndSlugs

· 05-26 06:41

A deep dive into emotional inline tags—can they make AI speak with an offbeat, backhanded, and vaguely sarcastic “monotone” delivery?

View OriginalReply0

FragilePosition

· 05-26 06:41

Word-level timestamps + speaker separation, the joy of podcast editing

View OriginalReply0

MempoolSparrow

· 05-26 06:35

WebSocket real-time stream $0.2/hour, is it cheaper or more expensive than Whisper?

View OriginalReply0

GateUser-b6d80ba0

· 05-26 06:30

Starlink customer service is now in use, no wonder the last time I called customer support, it felt like talking to an AI on the other end.

View OriginalReply0

AirdropMileCounter

· 05-26 06:28

25+ language coverage, how is the Chinese performance? Has anyone tested it?

View OriginalReply0

ReflectiveChainShadow

· 05-26 06:28

The same audio stack links the car system, satellites, and chat—there’s definitely something going on with this xAI ecosystem closed loop.

View OriginalReply0

MintAfterCoffee

· 05-26 06:28

What is reverse text normalization, a cutting-edge technology? Can someone knowledgeable explain it in detail?

View OriginalReply0

Trending Topics
View More
#
StockTradingChallengeUpTo17000U
16.23M Popularity
#
TrumpBacksCFTCAuthorityOverPredictionMarkets
818.01K Popularity
#
GatePredictionMarketAddsSmartMoneyTracking
13.77M Popularity
#
MicronMarketCapBreaks1Trillion
37.03K Popularity
#
TradeCFDWinGold
3.08M Popularity

Pinned

Sitemap

xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

Trending Topics

StockTradingChallengeUpTo17000U

TrumpBacksCFTCAuthorityOverPredictionMarkets

GatePredictionMarketAddsSmartMoneyTracking

MicronMarketCapBreaks1Trillion

TradeCFDWinGold

Pinned