xAI launches Grok speech-to-text and text-to-speech API

robot
Abstract generation in progress
ME News Report, April 18 (UTC+8), xAI recently announced the official launch of two standalone audio APIs: Grok Speech-to-Text (STT) and Grok Text-to-Speech (TTS). Grok STT offers high-accuracy, low-latency transcription services, supporting batch processing via REST API and real-time streaming via WebSocket API, with features including word-level timestamps, speaker separation, multi-channel support, and intelligent reverse text normalization. The article mentions that in benchmark tests across multiple fields such as phone calls, meetings, videos/podcasts, its word error rate outperforms mainstream commercial models like ElevenLabs, Deepgram, and AssemblyAI. The service supports over 25 languages, priced at $0.10 per hour for batch processing and $0.20 per hour for streaming. Grok TTS can generate fast, natural, and expressive speech, supporting fine-grained control through simple voice tags, priced at $4.20 per 1 million characters. Both APIs are built on the same technology stack that powers Grok Voice, Tesla vehicle support, and Starlink customer support. (Source: InFoQ)
XAI-2.15%
GROK3.4%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 7
  • 3
  • Share
Comment
Add a comment
Add a comment
RiskParachute
· 7h ago
Wait, can TTS control fine-grained details through tags?
Can it also adjust emotional tone?
View OriginalReply0
BitByBitBenny
· 8h ago
Word-level timestamps + speaker separation, a perfect tool for meeting minutes, want to give it a try
View OriginalReply0
FrictionlessFred
· 8h ago
Grok Voice, Tesla, and Starlink share the same tech stack; Elon Musk has figured out this ecosystem closed-loop.
View OriginalReply0
GoldfishUnderTheIce
· 8h ago
What is reverse text normalization, a cutting-edge technology that converts spoken language back into standard text?
View OriginalReply0
Don'tMessWithSlippage.
· 8h ago
Has anyone tested how well the Chinese version performs with coverage of 25 languages?
View OriginalReply0
YieldBonsai
· 8h ago
$4.20 per million characters, this number is intentional, right?
View OriginalReply0
IOnlyTrustOn-ChainData.
· 8h ago
xAI's recent audio API pricing is quite aggressive, at $0.1 per hour in bulk. It seems like it's going to wipe out a lot of ASR vendors.
View OriginalReply0