skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

mms-tts-eng

facebook/mms-tts-eng

A popular open text-to-speech model, with 137K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status
coming soon
API providers
0
downloads / mo
137K
license
cc-by-nc-4.0

about this model

This is the English (eng) language text-to-speech checkpoint from Meta AI's Massively Multilingual Speech (MMS) project. It uses a VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) architecture, which is an end-to-end conditional variational autoencoder that predicts speech waveforms directly from text. The model combines a flow-based acoustic module with a HiFi-GAN-style decoder and includes a stochastic duration predictor, enabling varied speech rhythms from the same input text.

Key Strengths

  • End-to-end training with variational lower bound and adversarial losses for high-quality synthesis.
  • Stochastic duration predictor allows non-deterministic output, giving flexibility in rhythm and expressiveness.
  • Trained as part of the MMS initiative, which scales speech technology to over 1,000 languages; this checkpoint is specialized for English.
  • Licensed under CC-BY-NC 4.0.

Recommended Use

This model is best suited for English text-to-speech applications that require natural, varied speech output with a single-language focus. Because gigarouter hosts the model as a managed API, developers can integrate it directly via OpenAI-compatible endpoints without managing dependencies or hardware.

Benchmark Results

The model card does not report specific benchmark numbers. For further technical details, refer to the MMS paper (Pratap et al., 2023).

Model Architecture

VITS uses a posterior encoder, decoder, and conditional prior. A transformer-based text encoder and multiple coupling layers predict spectrogram features, which are then decoded by a stack of transposed convolutional layers (similar to HiFi-GAN). During inference, text encodings are up-sampled using the duration predictor and passed through the flow module and decoder to generate the waveform.

not yet live

We're benchmarking and onboarding mms-tts-eng as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.