models / text-to-speech · coming soon

Qwen3-TTS-12Hz-1.7B-VoiceDesign

Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign

A popular open text-to-speech model, with 657.8K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status

coming soon

API providers

downloads / mo

657.8K

license

apache-2.0

about this model

Qwen3-TTS is a speech generation model that supports voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control. It is developed by Qwen and covers 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) as well as multiple dialectal voice profiles.

Architecture and Key Strengths

Powered by a self-developed Qwen3-TTS-Tokenizer-12Hz for efficient acoustic compression and high-dimensional semantic modeling.
Universal end-to-end architecture using a discrete multi-codebook language model to bypass traditional information bottlenecks.
Extreme low-latency streaming generation with end-to-end synthesis latency as low as 97 ms.
Intelligent voice control via natural language instructions for flexible manipulation of timbre, emotion, and prosody.

Performance

Zero-shot speech generation on the Seed-TTS test set yields the following word error rates (WER, lower is better):

Model	test-zh	test-en
Qwen3-TTS-12Hz-1.7B-Base	0.77	1.24

Additional resources: Blog | Paper | GitHub

not yet live

We're benchmarking and onboarding Qwen3-TTS-12Hz-1.7B-VoiceDesign as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.