ME News reports that xAI has launched two standalone audio APIs: Grok STT and Grok TTS, both originating from the same audio stack, supporting Grok Voice, Tesla in-car systems, and Starlink customer service, among others. STT offers REST batch transcription and WebSocket real-time streaming, with word-level timestamps, speaker separation, multi-channel support, and inverse text normalization, covering over 25 languages; TTS supports inline tags for emotion and prosody. They also announced WER comparisons, with Grok leading in multiple scenarios, but no third-party re-evaluation has been conducted yet. Pricing: STT batch processing at $0.10 per hour, streaming at $0.20 per hour; TTS at $4.20 per million characters.

MeNews

2026-05-26 17:23:03

Abstract generation in progress

ME News update. On April 18 (UTC+8), according to Beating Monitoring, xAI has launched two standalone audio APIs: Grok Speech to Text and Grok Text to Speech. Both are from the same audio stack that supports Grok Voice, Tesla’s in-car system, and Starlink customer service. This time, they are opened up in the form of independent endpoints, allowing developers to directly integrate applications such as voice agents, real-time transcription, accessibility tools, and podcasts.

STT offers two modes. The REST API is for batch transcription of large audio files, with millisecond-level responses; the WebSocket API is designed for real-time audio streams. Included capabilities include word-level timestamps, speaker diarization, multi-channel recognition, and Inverse Text Normalization, which automatically converts spoken numbers, dates, and currencies into standardized structured text. The service covers 25+ languages and supports seamless switching within conversations.

xAI also released a set of word error rate (WER, lower is better) comparisons: overall Grok 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, and AssemblyAI 12.9%. The gap is even larger in “telephone call entity recognition”: Grok 5.0%, compared with 12.0%, 13.5%, and 21.3% for the other three. In common business scenarios such as meetings, video podcasts, and phone calls, Grok also maintains a slight lead. These figures were self-tested and published by xAI, and there has been no third-party re-testing yet.

In terms of pricing, batch STT is $0.10 per hour, and streaming STT is $0.20 per hour; TTS is $4.20 per 1 million characters. TTS supports controlling emotion and intonation using inline Speech Tags, such as \[laugh\], \[sigh\], and \[whisper\]. (Source: BlockBeats)

XAI-1.74%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

10 Likes

Reward
10
6
Repost
Share

Comment

Add a comment

SudoSatoshi

· 3h ago

Multilingual coverage of 25+ languages, but how is the quality for low-resource languages?
The average WER looks good, but long-tail languages might still be a disaster.

View OriginalReply0

AirdropUnderTheNeonBridge

· 3h ago

Emotion and prosody inline tags? TTS is finally no longer just a script-reading machine; it can now add flair when doing audiobooks or game NPC dialogues.

View OriginalReply0

AirdropCartographer

· 3h ago

Multi-channel + speaker separation, a meeting recording transcription tool, but with a streaming cost of $0.2 per hour, long meetings still add up to not being cheap.

View OriginalReply0

PerpPulse

· 3h ago

Grok Voice, Tesla’s in-car system, and Starlink customer service all use the same audio stack, and Musk’s ecosystem closed-loop is getting more and more slick.

View OriginalReply0

MintLaterMaybe

· 3h ago

What is reverse text normalization? Converting numbers to Arabic numerals? This feature is quite important for post-processing speech transcription, saving the trouble of writing regex yourself.

View OriginalReply0

CliffsideAncientPineAndRolling

· 3h ago

xAI's recent audio API combo punch is quite aggressive, with streaming STT at $0.2 per hour, TTS at $4.2 per million characters, and the pricing strategy clearly aimed at large-scale commercial use.

View OriginalReply0

Trending Topics
View More
#
StockTradingChallengeUpTo17000U
16.22M Popularity
#
USStrikesIran
9.31M Popularity
#
GatePredictionMarketAddsSmartMoneyTracking
13.8M Popularity
#
InstitutionalCapitalRotatesFromBTCToHYPEAndXRP
14.33M Popularity
#
TradeCFDWinGold
3.08M Popularity

Pinned

Sitemap

xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

Trending Topics

StockTradingChallengeUpTo17000U

USStrikesIran

GatePredictionMarketAddsSmartMoneyTracking

InstitutionalCapitalRotatesFromBTCToHYPEAndXRP

TradeCFDWinGold

Pinned