Distil Large V3.5
distil-whisper/distil-large-v3.5
published Dec 2024 · updated Apr 2026
Distil Large V3.5 is a distilled automatic speech recognition model that transcribes English audio with high efficiency, offering ~1.5x faster inference than Whisper Large V3 Turbo while maintaining competitive word error rates.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | Distil-Whisper (knowledge-distilled Transformer) |
| Parameters | 756M |
| Training Data | 98k hours of diverse public audio |
about this model
Distil-Whisper Distil-Large-v3.5 is an automatic speech recognition (ASR) model that distills OpenAI’s Whisper-Large-v3 into a smaller, faster variant while preserving accuracy. It is trained on over 98,000 hours of diverse public data using a patient teacher, extended training schedule, and SpecAugment data augmentation, resulting in improved robustness over earlier Distil-Whisper models.
Key strengths
- Speed: Approximately 1.5× faster than Whisper-Large-v3-Turbo in real-time factor (RTFx) on long-form transcription.
- Accuracy: Out-of-distribution Word Error Rate (WER) of 7.08 on short-form and 11.39 on long-form tasks.
- Speculative decoding: Can serve as a draft model for Whisper-Large-v3, achieving ~2× faster inference while producing identical outputs.
Performance comparisons
Short-form OOD WER (lower is better):
| Model | Params (M) | OOD WER |
|---|---|---|
| large-v3-turbo | 809 | 7.30 |
| distil-large-v3 | 756 | 7.53 |
| distil-large-v3.5 | 756 | 7.08 |
Long-form OOD WER and average RTFx (higher RTFx is faster):
| Model | OOD WER | Avg RTFx |
|---|---|---|
| large-v3-turbo | 10.25 | 33.81 |
| distil-large-v3 | 11.6 | 48.64 |
| distil-large-v3.5 | 11.39 | 49.34 |
The model is a collaborative effort by Bofeng Huang, Eustache Le Bihan, Steven Zheng, and Vaibhav Srivastav.
best for
- ·Short-form transcription of audio clips under 30 seconds
- ·Long-form transcription using sequential or chunked decoding
- ·Speculative decoding as a draft model paired with Whisper Large V3
FAQ
It is ~1.5x faster and performs slightly better on short-form transcription, while being slightly behind on long-form.
Audio files or waveforms sampled at 16 kHz, processed via the Whisper feature extractor.
Send HTTP requests to the OpenAI-compatible endpoint with your API key and audio data.
Yes, it works as a draft model for Whisper Large V3, achieving ~2x faster inference while preserving identical outputs.
GPU with CUDA for optimal performance, but CPU inference is possible with reduced speed.
We're benchmarking and onboarding Distil Large V3.5 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.