Kyutai STT 2.6B English
kyutai/stt-2.6b-en
published Jun 2025 · updated Jun 2025
Kyutai STT 2.6B English is a streaming speech-to-text model that transcribes English audio with a 2.5 second delay.
specs
| Task | Automatic Speech Recognition (ASR) / Streaming Speech-to-Text |
| Architecture | Decoder-only Transformer with Mimi audio tokenizer |
| Parameters | ~2.6 billion |
| License | CC-BY 4.0 |
about this model
Kyutai/stt-2.6b-en is a streaming automatic speech recognition (ASR) model that transcribes English audio into text with punctuation and capitalization, producing output as soon as a few seconds of audio become available.
Key Capabilities
- Streaming inference: processes audio in chunks for real-time transcription, suitable for interactive applications.
- Returns word-level timestamps for each transcribed token.
- Robust to noisy conditions; performs reliably on audio segments up to 2 hours without additional adaptation.
- Based on a decoder-only Transformer architecture that consumes audio tokenized by Mimi (12.5 Hz frame rate, 32 audio tokens per frame) and outputs text tokens. The text stream is shifted by a 2.5-second delay relative to the audio stream.
Performance
- On a single H100 GPU, the model can batch-process 400 audio streams in real time.
- A single L40S GPU serves 64 simultaneous streaming connections via a Rust websocket server at a 3x real-time factor.
Training Details
The model was pretrained on 2.5 million hours of publicly available audio with synthetic transcripts from Whisper-timestamped, then fine-tuned on 24,000 hours of ground-truth transcribed public datasets, followed by a long-form fine-tuning stage using concatenated LibriSpeech examples and synthesized dialogs (total 23,000 hours).
Additional Features
The model outputs transcripts that include capitalization and punctuation. Word-level timestamps can be derived by subtracting the 2.5-second text stream offset from the audio frame offset. The model is English-only (language identifier en) and released under CC-BY 4.0. Parameter count is approximately 2.7 billion (verified via safetensors metadata).
best for
- ·Real-time transcription of live audio streams or voice calls
- ·Building voice agents that require word-level timestamps and low latency
FAQ
The model introduces a 2.5 second delay between audio input and text output.
The model weights are licensed under CC-BY 4.0.
Yes, it returns word-level timestamps along with the transcript.
Use the OpenAI-compatible endpoint with your gigarouter API key and specify the model ID kyutai/stt-2.6b-en.
The model accepts audio tokenized by the Mimi codec; for API usage, raw audio is processed into the required format by gigarouter.
We're benchmarking and onboarding Kyutai STT 2.6B English as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.