skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

Kimi-Audio-7B-Instruct

moonshotai/Kimi-Audio-7B-Instruct

A popular open text-to-speech model, with 79K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status
coming soon
API providers
0
downloads / mo
79K
license
mit

about this model

Kimi-Audio-7B-Instruct is an open-source audio foundation model specialized for text-to-speech (TTS) and a broad range of audio processing tasks including speech recognition, audio question answering, captioning, emotion recognition, sound event classification, and end-to-end speech conversation. Hosted by Gigarouter as a managed API, it delivers production-ready inference without the overhead of model deployment.

Kimi-Audio architecture diagram

Key capabilities

  • Unified framework for diverse audio tasks within a single model
  • State-of-the-art performance on numerous audio benchmarks (see technical report)
  • Pre-trained on 13 million hours of speech, music, and environmental audio plus text data
  • Hybrid audio input combining continuous acoustic tokens with discrete semantic tokens
  • LLM core with parallel output heads for text and audio token generation
  • Chunk-wise streaming detokenizer based on flow matching for low-latency audio synthesis

Ideal use cases

  • Real-time TTS for conversational agents, voice assistants, and content narration
  • High-quality speech-to-speech interaction with both text and audio output
  • Multi-purpose audio understanding and generation in a single API call

Performance highlights

Kimi-Audio achieves state-of-the-art results across multiple audio benchmarks. For detailed comparisons, consult the technical report.

Additional resources

not yet live

We're benchmarking and onboarding Kimi-Audio-7B-Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.