Qwen3-TTS-12Hz-1.7B-CustomVoice
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
A popular open text-to-speech model, with 2M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
about this model
Qwen3-TTS-12Hz-1.7B-CustomVoice is a text-to-speech (TTS) model that provides style control over target timbres via user instructions and supports 9 premium timbres covering combinations of gender, age, language, and dialect across 10 languages.
The model uses a discrete multi-codebook language model architecture for end-to-end speech modeling, bypassing the bottlenecks and cascading errors of traditional LM+DiT pipelines. It is built on Qwen3-TTS-Tokenizer-12Hz, which performs efficient acoustic compression and high-dimensional semantic modeling while preserving paralinguistic and acoustic environmental features.
Key Capabilities
- Multilingual support: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian, including multiple dialectal voice profiles.
- Low-latency streaming: outputs the first audio packet immediately after a single character is input; end-to-end synthesis latency as low as 97 ms.
- Instruction-driven voice control: natural language instructions enable adaptive adjustment of timbre, emotion, tone, and speaking rate.
- Contextual understanding: adapts emotional expression and prosody based on text semantics and instructions.
- Robustness to noisy input text.
Supported Speakers (CustomVoice)
The model includes 9 built-in premium timbres. Using each speaker’s native language yields best quality; all speakers can speak any supported language.
| Speaker | Voice Description | Native Language |
|---|---|---|
| Vivian | Bright, slightly edgy young female voice | Chinese |
| Serena | Warm, gentle young female voice | Chinese |
| Uncle_Fu | Seasoned male voice with a low, mellow timbre | Chinese |
| Dylan | Youthful Beijing male voice with a clear, natural timbre | Chinese (Beijing Dialect) |
| Eric | Lively Chengdu male voice with a slightly husky brightness | Chinese (Sichuan Dialect) |
| Ryan | Dynamic male voice with strong rhythmic drive | English |
| Aiden | Sunny American male voice with a clear midrange | English |
| Ono_Anna | Playful Japanese female voice with a light, nimble timbre | Japanese |
| Sohee | Warm Korean female voice with rich emotion | Korean |
Best Use Cases
- Real-time interactive applications requiring low-latency streaming TTS.
- Multilingual voice output with fine-grained control over timbre, emotion, and dialect.
- Production deployments that need a single model to handle both streaming and non-streaming generation via a dual-track hybrid architecture.
This model is hosted on gigarouter as a managed, OpenAI-compatible API. No local installation or model loading is required — simply call the API endpoint.
We're benchmarking and onboarding Qwen3-TTS-12Hz-1.7B-CustomVoice as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.