skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

Qwen3-TTS-12Hz-1.7B-CustomVoice

Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

A popular open text-to-speech model, with 2M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status
coming soon
API providers
0
downloads / mo
2M
license
apache-2.0

about this model

Qwen3-TTS-12Hz-1.7B-CustomVoice is a text-to-speech (TTS) model that provides style control over target timbres via user instructions and supports 9 premium timbres covering combinations of gender, age, language, and dialect across 10 languages.

Overview diagram of Qwen3-TTS model

The model uses a discrete multi-codebook language model architecture for end-to-end speech modeling, bypassing the bottlenecks and cascading errors of traditional LM+DiT pipelines. It is built on Qwen3-TTS-Tokenizer-12Hz, which performs efficient acoustic compression and high-dimensional semantic modeling while preserving paralinguistic and acoustic environmental features.

Key Capabilities

  • Multilingual support: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian, including multiple dialectal voice profiles.
  • Low-latency streaming: outputs the first audio packet immediately after a single character is input; end-to-end synthesis latency as low as 97 ms.
  • Instruction-driven voice control: natural language instructions enable adaptive adjustment of timbre, emotion, tone, and speaking rate.
  • Contextual understanding: adapts emotional expression and prosody based on text semantics and instructions.
  • Robustness to noisy input text.

Supported Speakers (CustomVoice)

The model includes 9 built-in premium timbres. Using each speaker’s native language yields best quality; all speakers can speak any supported language.

SpeakerVoice DescriptionNative Language
VivianBright, slightly edgy young female voiceChinese
SerenaWarm, gentle young female voiceChinese
Uncle_FuSeasoned male voice with a low, mellow timbreChinese
DylanYouthful Beijing male voice with a clear, natural timbreChinese (Beijing Dialect)
EricLively Chengdu male voice with a slightly husky brightnessChinese (Sichuan Dialect)
RyanDynamic male voice with strong rhythmic driveEnglish
AidenSunny American male voice with a clear midrangeEnglish
Ono_AnnaPlayful Japanese female voice with a light, nimble timbreJapanese
SoheeWarm Korean female voice with rich emotionKorean
Architecture diagram of Qwen3-TTS

Best Use Cases

  • Real-time interactive applications requiring low-latency streaming TTS.
  • Multilingual voice output with fine-grained control over timbre, emotion, and dialect.
  • Production deployments that need a single model to handle both streaming and non-streaming generation via a dual-track hybrid architecture.

This model is hosted on gigarouter as a managed, OpenAI-compatible API. No local installation or model loading is required — simply call the API endpoint.

not yet live

We're benchmarking and onboarding Qwen3-TTS-12Hz-1.7B-CustomVoice as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.