Kimi-Audio-7B-Instruct
moonshotai/Kimi-Audio-7B-Instruct
A popular open text-to-speech model, with 79K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
about this model
Kimi-Audio-7B-Instruct is an open-source audio foundation model specialized for text-to-speech (TTS) and a broad range of audio processing tasks including speech recognition, audio question answering, captioning, emotion recognition, sound event classification, and end-to-end speech conversation. Hosted by Gigarouter as a managed API, it delivers production-ready inference without the overhead of model deployment.
Key capabilities
- Unified framework for diverse audio tasks within a single model
- State-of-the-art performance on numerous audio benchmarks (see technical report)
- Pre-trained on 13 million hours of speech, music, and environmental audio plus text data
- Hybrid audio input combining continuous acoustic tokens with discrete semantic tokens
- LLM core with parallel output heads for text and audio token generation
- Chunk-wise streaming detokenizer based on flow matching for low-latency audio synthesis
Ideal use cases
- Real-time TTS for conversational agents, voice assistants, and content narration
- High-quality speech-to-speech interaction with both text and audio output
- Multi-purpose audio understanding and generation in a single API call
Performance highlights
Kimi-Audio achieves state-of-the-art results across multiple audio benchmarks. For detailed comparisons, consult the technical report.
Additional resources
We're benchmarking and onboarding Kimi-Audio-7B-Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.