Whisper Large V2
openai/whisper-large-v2
published Dec 2022 · updated Feb 2024
Whisper Large V2 is an automatic speech recognition (ASR) model that transcribes and translates multilingual speech using a Transformer encoder-decoder architecture trained on 680k hours of weakly supervised data.
specs
| Task | Automatic Speech Recognition (ASR) and Speech Translation |
| Architecture | Transformer encoder-decoder (sequence-to-sequence) |
| Parameters | 1550M |
| License | MIT |
about this model
Key Strengths
Whisper large-v2 is trained for 2.5x more epochs than the original large model with added regularization, improving robustness. It generalizes to unseen datasets in a zero-shot manner, matching or exceeding prior fully supervised results on standard benchmarks without fine-tuning.
Benchmark Performance
On the Open ASR Leaderboard, large-v2 achieves a mean Word Error Rate (WER) of 7.835 across multiple datasets, with a Real-Time Factor (RTFX) of 144.452. Specific dataset results include:
| Dataset | WER |
|---|---|
| LibriSpeech clean | 2.83 |
| Earnings22 | 12.05 |
| Gigaspeech | 10.67 |
| AMI | 16.74 |
On LibriSpeech test-clean, the model achieves a WER of approximately 3.0% in a zero-shot setting.
Architecture and Capabilities
The model supports multilingual speech recognition across 99 languages, speech translation from any supported language into English, and language identification. It processes audio natively in 30-second segments and supports arbitrary-length transcription through chunking with batched inference and optional timestamp prediction.
Performance Characteristics
Large-v2 requires approximately 10 GB of VRAM for inference. Its relative inference speed is 1x (baseline), compared to smaller variants which offer faster throughput at reduced accuracy. The model is available through gigarouter as a hosted, OpenAI-compatible API, eliminating the need for local GPU infrastructure or model management.
best for
- ·Transcribing multilingual audio with high accuracy in a zero-shot setting
- ·Translating spoken language (e.g., French to English) without fine-tuning
- ·Processing long audio files via chunked pipeline for full transcription
FAQ
It performs automatic speech recognition (transcription) and speech translation across multiple languages.
It has 1,550 million (1.55B) parameters.
The model is released under the MIT license.
Approximately 10 GB of VRAM is recommended.
Send audio input to the gigarouter OpenAI-compatible endpoint with your API key; the model returns transcribed or translated text.
We're benchmarking and onboarding Whisper Large V2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.