ME News reports that xAI has launched two standalone audio APIs: Grok STT and Grok TTS, both originating from the same audio stack, supporting Grok Voice, Tesla in-car systems, and Starlink customer service, among others. STT offers REST batch transcription and WebSocket real-time streaming, with word-level timestamps, speaker separation, multi-channel support, and inverse text normalization, covering over 25 languages; TTS supports inline tags for emotion and prosody. They also announced WER comparisons, with Grok leading in multiple scenarios, but no third-party re-evaluation has been conducted yet. Pricing: STT batch processing at $0.10 per hour, streaming at $0.20 per hour, TTS at $4.20 per million characters.

MeNews

2026-05-27 02:47:48

Abstract generation in progress

ME News report, April 18 (UTC+8), according to Dongcha Beating Monitoring, xAI has launched two independent audio APIs: Grok Speech to Text and Grok Text to Speech. Both come from the same audio stack that supports Grok Voice, Tesla’s in-car system, and Starlink customer service. This time, they are opened up in the form of standalone endpoints, allowing developers to directly connect voice agents, real-time transcription, accessibility tools, and podcast-style applications.

STT offers two modes. The REST API is used for batch transcription of large audio files, with millisecond-level responses; the WebSocket API is designed for real-time audio streams. Included capabilities include word-level timestamps, speaker diarization, separate recognition for multiple channels, and Inverse Text Normalization, which automatically converts spoken numbers, dates, and currencies into standardized structured text. The system covers 25+ languages and allows seamless switching during conversations.

xAI also released a set of word error rate (WER; the lower the better) comparisons: for the overall scenario, Grok is 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, and AssemblyAI 12.9%. The gap is even larger for “telephone call entity recognition”: Grok is 5.0%, compared with 12.0%, 13.5%, and 21.3% for the other three. In common business scenarios such as meetings, video podcasts, and phone calls, Grok also maintains a slight lead. These figures are published based on xAI’s own testing, with no third-party re-verification yet.

In terms of pricing, batch STT is $0.10 per hour, and streaming is $0.20 per hour. TTS costs $4.20 per 1 million characters. TTS supports controlling emotion and prosody using inline Speech Tags, such as \[laugh\], \[sigh\], \[whisper\], and others. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

9 Likes

Reward
9
6
Repost
Share

Comment

Add a comment

GovernanceVotingTug-Of-WarKing

· 2h ago

25+ language coverage is decent, but the quality of less common languages is questionable; only by trying will you know.

View OriginalReply0

ViewingBullAndBearMarketsFromA

· 2h ago

WebSocket real-time stream latency in milliseconds is not specified, which is critical in live broadcast scenarios.

View OriginalReply0

BorrowedHalo

· 2h ago

Emotional tags embedded, will AI podcasts be able to detect sarcasm in the future?

View OriginalReply0

PuddingMarketMaker

· 2h ago

Starlink customer service is already all set up—Musk’s ecosystem closed loop is effectively proven.

View OriginalReply0

GateUser-83c80dd0

· 2h ago

Word-level timestamps + speaker separation, podcast editing enthusiasts rejoice

View OriginalReply0

TideEarningsTable

· 2h ago

4.2 million characters in USD, is it cheaper or more expensive than ElevenLabs? Has anyone calculated it?

View OriginalReply0

Trending Topics
View More
#
StockTradingChallengeUpTo17000U
16.22M Popularity
#
TrumpBacksCFTCAuthorityOverPredictionMarkets
816.65K Popularity
#
GatePredictionMarketAddsSmartMoneyTracking
13.2M Popularity
#
MicronMarketCapBreaks1Trillion
36.25K Popularity
#
TradeCFDWinGold
3.08M Popularity

Pinned

Sitemap

xAI opens Grok STT and TTS audio APIs, with overall word error rate of STT reduced to 6.9%

Trending Topics

StockTradingChallengeUpTo17000U

TrumpBacksCFTCAuthorityOverPredictionMarkets

GatePredictionMarketAddsSmartMoneyTracking

MicronMarketCapBreaks1Trillion

TradeCFDWinGold

Pinned