Like the sound of Tesla? xAI officially releases the Grok voice API, with TTS costing $4.2 per million characters and recognition accuracy surpassing ElevenLabs.

Question

xAI officially launches independent Grok speech-to-text (STT) and text-to-speech (TTS) APIs this week, with this technology stack already operational in Grok Voice, Tesla vehicles, and Starlink customer service systems. STT pricing is $0.10 per batch hour and $0.20 per streaming hour, supporting over 25 languages.
(Background: Grok 4.3 beta opens to Heavy subscription users! Elon Musk: the flagship version training completed after 5 days)
(Additional background: Google launches Gemini 3.1 Flash TTS: audio tags make AI voiceovers more lively, supporting 70+ languages, free trial on Google AI Studio)

Table of Contents

Toggle

STT: word-level timestamps + speaker diarization, only $0.10 per hour for batch transcription
TTS: 5 voice personalities + speech tags, $4.20 per million characters
The same technology stack powers Tesla and Starlink

The same voice technology that makes Tesla vehicles speak and Starlink customer service respond to users is now available via API. On the 17th, xAI officially announced the launch of independent Grok speech-to-text (STT) and text-to-speech (TTS) APIs, allowing external developers to directly access this speech infrastructure already in use within xAI’s products.

STT: word-level timestamps + speaker diarization, only $0.10 per hour for batch transcription

According to official details, the Grok STT API offers two access modes: batch processing via REST API and low-latency real-time streaming via WebSocket API. Pricing is $0.10 per hour for batch and $0.20 per hour for streaming, with the company claiming a significant price advantage over mainstream competitors like ElevenLabs and Deepgram.

Functionally, Grok STT supports over 25 languages, with word-level timestamps, speaker diarization, multi-channel audio, and intelligent reverse text normalization. It’s suitable for enterprise scenarios requiring high accuracy, such as meeting transcription, legal and medical records, and customer call logs.

In physical entity recognition benchmarks, Grok STT shows an advantage. When identifying key entities like names, accounts, and dates in phone calls, Grok STT’s error rate is 5.0%, compared to 12.0% for ElevenLabs, 13.5% for Deepgram, and 21.3% for AssemblyAI.

TTS: 5 voice personalities + speech tags, $4.20 per million characters

Grok TTS API offers five distinct voice styles: Ara (female, warm and friendly), Eve (female, lively and positive), Leo (male, authoritative and powerful), Rex (male, confident and clear), Sal (neutral, smooth and balanced).

The API automatically detects input language, natively supports over 20 languages, and uses BCP-47 language codes to control pronunciation.

Audio output formats include MP3, WAV, PCM (Linear16), G.711 μ-law, and G.711 A-law. The latter two are common telephony codecs, indicating xAI’s layout for telecom integration.

A key feature of the TTS API is “speech tags,” allowing developers to embed commands within text to finely control pauses, laughter, whispers, intonation emphasis, speech rate, and pitch, making synthesized speech more natural and human-like. Pricing is $4.20 per million characters.

The same technology stack powers Tesla and Starlink

xAI emphasizes that these two APIs are not entirely new developments but are based on the same infrastructure already deployed in Grok Voice, Tesla vehicle voice interactions, and Starlink customer support systems.

This infrastructure first appeared at the end of 2025 as the Grok Voice Agent API, providing real-time voice dialogue capabilities, and ranked first in the Big Bench Audio benchmark, with initial audio response times under 1 second—about five times faster than recent competitors.

The release of these standalone STT and TTS endpoints effectively splits this integrated voice pipeline into individual components, allowing developers to assemble them as needed.

XAI-10.42%

View Original

Like the sound of Tesla? xAI officially releases the Grok voice API, with TTS costing $4.2 per million characters and recognition accuracy surpassing ElevenLabs.

STT: word-level timestamps + speaker diarization, only $0.10 per hour for batch transcription

TTS: 5 voice personalities + speech tags, $4.20 per million characters

The same technology stack powers Tesla and Starlink

Trending Topics

GatePreIPOsLaunchesWithSpaceX

Gate13thAnniversaryLive

AltcoinsRallyStrong

AnthropicvsOpenAIHeatsUp

KalshiFacesNevadaRegulatoryClash

Pin