models / text-to-speech · coming soon

Kimi-Audio-7B-Instruct

moonshotai/Kimi-Audio-7B-Instruct

A popular open text-to-speech model, with 79K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status

coming soon

API providers

downloads / mo

79K

license

mit

about this model

Kimi-Audio-7B-Instruct is an open-source audio foundation model specialized for text-to-speech (TTS) and a broad range of audio processing tasks including speech recognition, audio question answering, captioning, emotion recognition, sound event classification, and end-to-end speech conversation. Hosted by Gigarouter as a managed API, it delivers production-ready inference without the overhead of model deployment.

Key capabilities

Unified framework for diverse audio tasks within a single model
State-of-the-art performance on numerous audio benchmarks (see technical report)
Pre-trained on 13 million hours of speech, music, and environmental audio plus text data
Hybrid audio input combining continuous acoustic tokens with discrete semantic tokens
LLM core with parallel output heads for text and audio token generation
Chunk-wise streaming detokenizer based on flow matching for low-latency audio synthesis

Ideal use cases

Real-time TTS for conversational agents, voice assistants, and content narration
High-quality speech-to-speech interaction with both text and audio output
Multi-purpose audio understanding and generation in a single API call

Performance highlights

Kimi-Audio achieves state-of-the-art results across multiple audio benchmarks. For detailed comparisons, consult the technical report.

Additional resources

not yet live

We're benchmarking and onboarding Kimi-Audio-7B-Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.