skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

MOSS-TTS

OpenMOSS-Team/MOSS-TTS

A popular open text-to-speech model, with 911.8K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status
coming soon
API providers
0
downloads / mo
911.8K
license
apache-2.0

about this model

The MOSS‑TTS Family, developed by MOSI.AI and the OpenMOSS team, is an open‑source suite of speech and sound generation models designed for high‑fidelity, expressive synthesis in complex real‑world scenarios. The family comprises five specialized models that can be used independently or composed into a complete pipeline.

Model Components

  • MOSS‑TTS – Flagship production model for zero‑shot voice cloning, ultra‑long speech (up to one hour), token‑level duration control, and fine‑grained pronunciation control via Pinyin, IPA, or mixed input.
  • MOSS‑TTSD – Expressive, multi‑speaker spoken dialogue generation. Version 1.0 achieves industry‑leading objective metrics and subjectively outperformed top closed‑source models (Doubao, Gemini 2.5‑pro).
  • MOSS‑VoiceGenerator – Generates diverse voices and styles from text prompts without any reference audio. Surpasses other top voice design models in arena ratings.
  • MOSS‑SoundEffect – Creates controllable‑duration sound effects for natural environments, urban scenes, biological sounds, human actions, and music.
  • MOSS‑TTS‑Realtime – Multi‑turn context‑aware model for low‑latency voice agents. TTFB reaches 180 ms; total first‑sentence latency is 377 ms.

Key Strengths

  • Production‑ready voice cloning with stable speaker identity at scale
  • Long‑form stability for extended narration (up to 60 minutes in a single run)
  • Multilingual synthesis across 20 languages, including code‑switching
  • Controllable pacing, rhythm, and pronunciation at phoneme and token level

Benchmark and Evaluation Results

  • MOSS‑TTSD v1.0 outperformed Doubao and Gemini 2.5‑pro in subjective evaluations.
  • MOSS‑VoiceGenerator surpassed other top‑tier voice design models in arena ratings.
  • MOSS‑TTS‑Realtime: TTFB 180 ms; combined LLM first‑sentence + TTFB latency 377 ms.

The family supports 20 languages including Mandarin, English, German, Spanish, French, Japanese, Korean, Arabic, and more. All models are hosted as OpenAI‑compatible APIs on gigarouter, requiring no local setup.

MOSS-TTS Family overview MOSS-TTS introduction visualization
not yet live

We're benchmarking and onboarding MOSS-TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.