Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Like the sound of Tesla? xAI officially releases the Grok voice API, with TTS costing $4.2 per million characters and recognition accuracy surpassing ElevenLabs.
xAI officially launches independent Grok speech-to-text (STT) and text-to-speech (TTS) APIs this week, with this technology stack already operational in Grok Voice, Tesla vehicles, and Starlink customer service systems. STT pricing is $0.10 per batch hour and $0.20 per streaming hour, supporting over 25 languages.
(Background: Grok 4.3 beta opens to Heavy subscription users! Elon Musk: the flagship version training completed after 5 days)
(Additional background: Google launches Gemini 3.1 Flash TTS: audio tags make AI voiceovers more lively, supporting 70+ languages, free trial on Google AI Studio)
Table of Contents
Toggle
The same voice technology that makes Tesla vehicles speak and Starlink customer service respond to users is now available via API. On the 17th, xAI officially announced the launch of independent Grok speech-to-text (STT) and text-to-speech (TTS) APIs, allowing external developers to directly access this speech infrastructure already in use within xAI’s products.
STT: word-level timestamps + speaker diarization, only $0.10 per hour for batch transcription
According to official details, the Grok STT API offers two access modes: batch processing via REST API and low-latency real-time streaming via WebSocket API. Pricing is $0.10 per hour for batch and $0.20 per hour for streaming, with the company claiming a significant price advantage over mainstream competitors like ElevenLabs and Deepgram.
Functionally, Grok STT supports over 25 languages, with word-level timestamps, speaker diarization, multi-channel audio, and intelligent reverse text normalization. It’s suitable for enterprise scenarios requiring high accuracy, such as meeting transcription, legal and medical records, and customer call logs.
In physical entity recognition benchmarks, Grok STT shows an advantage. When identifying key entities like names, accounts, and dates in phone calls, Grok STT’s error rate is 5.0%, compared to 12.0% for ElevenLabs, 13.5% for Deepgram, and 21.3% for AssemblyAI.
TTS: 5 voice personalities + speech tags, $4.20 per million characters
Grok TTS API offers five distinct voice styles: Ara (female, warm and friendly), Eve (female, lively and positive), Leo (male, authoritative and powerful), Rex (male, confident and clear), Sal (neutral, smooth and balanced).
The API automatically detects input language, natively supports over 20 languages, and uses BCP-47 language codes to control pronunciation.
Audio output formats include MP3, WAV, PCM (Linear16), G.711 μ-law, and G.711 A-law. The latter two are common telephony codecs, indicating xAI’s layout for telecom integration.
A key feature of the TTS API is “speech tags,” allowing developers to embed commands within text to finely control pauses, laughter, whispers, intonation emphasis, speech rate, and pitch, making synthesized speech more natural and human-like. Pricing is $4.20 per million characters.
The same technology stack powers Tesla and Starlink
xAI emphasizes that these two APIs are not entirely new developments but are based on the same infrastructure already deployed in Grok Voice, Tesla vehicle voice interactions, and Starlink customer support systems.
This infrastructure first appeared at the end of 2025 as the Grok Voice Agent API, providing real-time voice dialogue capabilities, and ranked first in the Big Bench Audio benchmark, with initial audio response times under 1 second—about five times faster than recent competitors.
The release of these standalone STT and TTS endpoints effectively splits this integrated voice pipeline into individual components, allowing developers to assemble them as needed.