According to ME News reports, xAI has launched two standalone audio APIs: Grok STT and Grok TTS, originating from the same audio stack, supporting Grok Voice, Tesla in-car systems, and Starlink customer service, among others. STT offers REST batch transcription and WebSocket real-time streaming, with word-level timestamps, speaker separation, multi-channel support, and inverse text normalization, covering over 25 languages; TTS supports inline tags for emotion and prosody. They also announced WER comparisons, with Grok leading in multiple scenarios, and no third-party re-evaluation available yet. Pricing: STT batch processing at $0.10 per hour, streaming at $0.20 per hour, TTS at $4.20 per million characters.

MeNews

2026-05-26 13:41:03

Abstract generation in progress

ME News, April 18 (UTC+8): According to Beating Monitoring, xAI has launched two separate audio APIs—Grok Speech to Text and Grok Text to Speech. Both are built on the same audio stack that supports Grok Voice, Tesla’s in-car system, and Starlink customer service. This time, the capabilities are released as standalone endpoints, enabling developers to directly integrate applications such as voice agents, real-time transcription, accessibility tools, and podcasts.

STT offers two modes. The REST API is for batch transcription of large audio files, returning results at millisecond-level speed; the WebSocket API is designed for real-time speech streams. Included capabilities include word-level timestamps, speaker diarization, multi-channel recognition per channel, and Inverse Text Normalization—automatically converting spoken numbers, dates, and currencies into standardized structured text. Language coverage exceeds 25 languages, with seamless switching during conversations.

xAI also released a set of word error rate (WER) comparisons (the lower the better): for the overall scenario, Grok is 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, and AssemblyAI 12.9%. The gap is even wider for “telephone call entity recognition”: Grok is 5.0%, while the corresponding figures for the other three are 12.0%, 13.5%, and 21.3% respectively. In typical business scenarios such as meetings, video podcasts, and phone calls, Grok also maintains a slight lead. These figures are self-tested and published by xAI, and there has been no third-party re-testing yet.

In terms of pricing, STT batch is $0.10 per hour, and streaming is $0.20 per hour; TTS costs $4.20 per 1 million characters. TTS supports inline Speech Tags to control emotion and intonation, such as \[laugh\], \[sigh\], and \[whisper\], and \[...\].

（Source: BlockBeats）

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

10 Likes

Reward
10
10
3
Share

Comment

Add a comment

InstantNoodle-LevelResearcher

· 5h ago

Batch transcription uses REST, real-time uses WebSocket; the architecture design is quite practical.

View OriginalReply0

GateUser-f85bc167

· 5h ago

Just wait for a real benchmark score; checking out xAI's benchmark will do.

View OriginalReply0

MarginMom

· 6h ago

Grok TTS supports emotional tags, which is pretty interesting. Will AI voiceovers be able to add crying tones in the future?

View OriginalReply0

GateUser-f92ba9fa

· 6h ago

25+ language coverage, has anyone tested how the Chinese version performs?

View OriginalReply0

Lightning-FastComposure

· 6h ago

What is reverse text normalization, a cutting-edge technology? Can someone knowledgeable explain it in detail?

View OriginalReply0

HaiyanColdWallet

· 6h ago

Word-level timestamps + speaker separation, meeting notes enthusiasts rejoice

View OriginalReply0

QuantsAndCats

· 6h ago

Is a TTS costing $4.2 per million characters cheaper or more expensive than ElevenLabs?

View OriginalReply0

AmberTeaSwirl

· 6h ago

Streaming STT $0.2/hour, should be able to run in real-time captioning scenarios

View OriginalReply0

MultisigOnRocks

· 6h ago

Feeding the same audio stack to Grok Voice, Tesla, and Starlink, Musk’s ecosystem has formed a closed-loop.

View OriginalReply0

BalanceScreenshotAfterTheRain

· 6h ago

xAI's audio API came a bit suddenly; is a STT pricing of $0.10/hour considered reasonable?

View OriginalReply0

Trending Topics
View More
#
StockTradingChallengeUpTo17000U
16.21M Popularity
#
USStrikesIran
9.31M Popularity
#
GatePredictionMarketAddsSmartMoneyTracking
13.07M Popularity
#
InstitutionalCapitalRotatesFromBTCToHYPEAndXRP
14.33M Popularity
#
TradeCFDWinGold
3.08M Popularity

Pinned

Sitemap

xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

Trending Topics

StockTradingChallengeUpTo17000U

USStrikesIran

GatePredictionMarketAddsSmartMoneyTracking

InstitutionalCapitalRotatesFromBTCToHYPEAndXRP

TradeCFDWinGold

Pinned