Qwen3-TTS-12Hz-1.7B-VoiceDesign
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
A popular open text-to-speech model, with 657.8K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
about this model
Qwen3-TTS is a speech generation model that supports voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control. It is developed by Qwen and covers 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) as well as multiple dialectal voice profiles.
Architecture and Key Strengths
- Powered by a self-developed Qwen3-TTS-Tokenizer-12Hz for efficient acoustic compression and high-dimensional semantic modeling.
- Universal end-to-end architecture using a discrete multi-codebook language model to bypass traditional information bottlenecks.
- Extreme low-latency streaming generation with end-to-end synthesis latency as low as 97 ms.
- Intelligent voice control via natural language instructions for flexible manipulation of timbre, emotion, and prosody.
Performance
Zero-shot speech generation on the Seed-TTS test set yields the following word error rates (WER, lower is better):
| Model | test-zh | test-en |
|---|---|---|
| Qwen3-TTS-12Hz-1.7B-Base | 0.77 | 1.24 |
We're benchmarking and onboarding Qwen3-TTS-12Hz-1.7B-VoiceDesign as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.