Whisper Medium English
openai/whisper-medium.en
published Sep 2022 · updated Jan 2024
Whisper Medium English is an automatic speech recognition model that transcribes English audio into text using a Transformer encoder-decoder architecture trained on 680k hours of weakly supervised data.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | Transformer encoder-decoder (sequence-to-sequence) |
| Parameters | 769 million |
| License | MIT |
about this model
Architecture and Capabilities
Whisper medium.en contains 769 million parameters and is optimized for English speech recognition. It processes audio in 30-second segments and supports long-form transcription via chunking, enabling arbitrary-length audio processing. The model can also predict word-level timestamps.
Benchmark Performance
On LibriSpeech test-clean, the model achieves a Word Error Rate (WER) of 4.12% (official result) and 3.02% under alternative inference settings. Additional benchmark results include:
- LibriSpeech test-other: 7.43% WER
- AMI: 16.68% WER
- Earnings22: 12.63% WER
- Gigaspeech: 11.03% WER
- Open ASR Leaderboard mean WER: 8.09
Inference Characteristics
The model requires approximately 5 GB of VRAM and runs at roughly 2x the inference speed of the large-v2 variant. It supports batched inference and chunked processing for audio of arbitrary length.
Training Data and Robustness
Trained on 680,000 hours of internet-sourced audio, the model demonstrates improved robustness to accents, background noise, and technical language compared to many existing ASR systems. It achieves near-state-of-the-art accuracy in a zero-shot transfer setting without fine-tuning.
best for
- ·English speech transcription with high accuracy
- ·Transcribing long audio files via chunked pipeline (up to arbitrary length)
- ·Zero-shot ASR on diverse English accents and domains without fine-tuning
FAQ
The model expects audio as a 16 kHz mono waveform, which is pre-processed into a log-Mel spectrogram. The gigarouter API accepts audio file uploads or raw audio bytes.
The model outputs transcribed English text as a string. It can also return timestamped segments when requested.
Approximately 5 GB of VRAM is required for inference.
The model is released under the MIT license.
Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model as openai/whisper-medium.en and sending the audio file or bytes in the request.
We're benchmarking and onboarding Whisper Medium English as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.