Whisper Large
openai/whisper-large
published Sep 2022 · updated Feb 2024
Whisper Large is an ASR and speech translation model that transcribes and translates multilingual audio using a transformer encoder-decoder trained on 680k hours of weakly supervised data.
specs
| Task | Automatic Speech Recognition and Speech Translation |
| Architecture | Transformer Encoder-Decoder |
| Parameters | 1.55B |
about this model
Key strengths
- Zero-shot generalization – matches or exceeds prior fully-supervised results on standard benchmarks without dataset-specific fine-tuning.
- Multitask capability – a single model replaces separate components for ASR, translation, language identification, and voice activity detection.
- Long-form support – transcribes audio of arbitrary length using 30-second chunking with batched inference and optional timestamp prediction.
Benchmark performance (Whisper-large)
| Dataset | Metric | Score |
|---|---|---|
| LibriSpeech clean | WER | 3.0% |
| LibriSpeech other | WER | 5.4% |
| Common Voice 11.0 (Hindi) | WER | 54.8% |
| Open ASR Leaderboard (mean) | WER | 7.94% |
| AMI | WER | 16.73% |
| Earnings22 | WER | 12.91% |
| Gigaspeech | WER | 10.76% |
These results come from the original model card and the Whisper paper. A newer variant, large-v2, improves upon these scores (e.g., LibriSpeech clean WER 2.83%) with the same architecture and is available separately on gigarouter.
Model details
- Parameters: 1,550 million
- Architecture: Transformer encoder-decoder
- Training data: 680k hours of labelled speech
- Languages: multilingual (outputs in 99+ languages)
- Tasks: transcription, translation, language identification
best for
- ·Transcribing English speech to text
- ·Translating non-English speech to English text
- ·Transcribing long audio files via chunking
FAQ
It excels at multilingual speech recognition and speech translation without fine-tuning, achieving strong zero-shot performance on many benchmarks.
Whisper Large is a multilingual model supporting many languages; it can transcribe in the same language or translate to English.
It has 1.55 billion parameters and requires approximately 10 GB of VRAM for inference.
The model expects audio samples preprocessed into log-Mel spectrograms; for long audio, chunks of 30 seconds are used.
Use the OpenAI-compatible endpoint with your API key and send audio data as per the gigarouter documentation.
We're benchmarking and onboarding Whisper Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.