Whisper Large V2

openai/whisper-large-v2

published Dec 2022 · updated Feb 2024

Whisper Large V2 is an automatic speech recognition (ASR) model that transcribes and translates multilingual speech using a Transformer encoder-decoder architecture trained on 680k hours of weakly supervised data.

status

coming soon

API providers

downloads / mo

115K

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR) and Speech Translation
Architecture	Transformer encoder-decoder (sequence-to-sequence)
Parameters	1550M
License	MIT

about this model

Whisper large-v2 is an automatic speech recognition (ASR) and speech translation model based on a Transformer encoder-decoder architecture, trained on 680,000 hours of weakly supervised multilingual data. It is a 1.55 billion parameter model that performs speech recognition, speech translation, and language identification without requiring fine-tuning for new domains. The model uses a sequence-to-sequence approach where tasks are specified via context tokens, enabling a single model to handle multilingual transcription, translation between languages, and timestamp prediction. It processes audio segments up to 30 seconds natively and supports arbitrary-length audio through chunking.

Key Strengths

Whisper large-v2 is trained for 2.5x more epochs than the original large model with added regularization, improving robustness. It generalizes to unseen datasets in a zero-shot manner, matching or exceeding prior fully supervised results on standard benchmarks without fine-tuning.

Benchmark Performance

On the Open ASR Leaderboard, large-v2 achieves a mean Word Error Rate (WER) of 7.835 across multiple datasets, with a Real-Time Factor (RTFX) of 144.452. Specific dataset results include:

Dataset	WER
LibriSpeech clean	2.83
Earnings22	12.05
Gigaspeech	10.67
AMI	16.74

On LibriSpeech test-clean, the model achieves a WER of approximately 3.0% in a zero-shot setting.

Architecture and Capabilities

The model supports multilingual speech recognition across 99 languages, speech translation from any supported language into English, and language identification. It processes audio natively in 30-second segments and supports arbitrary-length transcription through chunking with batched inference and optional timestamp prediction.

Performance Characteristics

Large-v2 requires approximately 10 GB of VRAM for inference. Its relative inference speed is 1x (baseline), compared to smaller variants which offer faster throughput at reduced accuracy. The model is available through gigarouter as a hosted, OpenAI-compatible API, eliminating the need for local GPU infrastructure or model management.

best for

·Transcribing multilingual audio with high accuracy in a zero-shot setting
·Translating spoken language (e.g., French to English) without fine-tuning
·Processing long audio files via chunked pipeline for full transcription

FAQ

What is the primary task of Whisper Large V2?

It performs automatic speech recognition (transcription) and speech translation across multiple languages.

How many parameters does the large-v2 model have?

It has 1,550 million (1.55B) parameters.

What is the license for Whisper Large V2?

The model is released under the MIT license.

How much VRAM is required to run Whisper Large V2?

Approximately 10 GB of VRAM is recommended.

How can I use Whisper Large V2 via the gigarouter API?

Send audio input to the gigarouter OpenAI-compatible endpoint with your API key; the model returns transcribed or translated text.

not yet live

We're benchmarking and onboarding Whisper Large V2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo