Whisper Large

openai/whisper-large

published Sep 2022 · updated Feb 2024

Whisper Large is an ASR and speech translation model that transcribes and translates multilingual audio using a transformer encoder-decoder trained on 680k hours of weakly supervised data.

status

coming soon

API providers

downloads / mo

35K

license

apache-2.0

specs

Task	Automatic Speech Recognition and Speech Translation
Architecture	Transformer Encoder-Decoder
Parameters	1.55B

about this model

Whisper-large is an automatic speech recognition (ASR) model that performs multilingual speech recognition, speech translation, and language identification without fine-tuning. It is a Transformer encoder-decoder (sequence-to-sequence) model trained on 680,000 hours of weakly-supervised multilingual and multitask data. The model can transcribe audio in the same language or translate it into a different language, and it can handle arbitrary-length audio via chunking.

Key strengths

Zero-shot generalization – matches or exceeds prior fully-supervised results on standard benchmarks without dataset-specific fine-tuning.
Multitask capability – a single model replaces separate components for ASR, translation, language identification, and voice activity detection.
Long-form support – transcribes audio of arbitrary length using 30-second chunking with batched inference and optional timestamp prediction.

Benchmark performance (Whisper-large)

Dataset	Metric	Score
LibriSpeech clean	WER	3.0%
LibriSpeech other	WER	5.4%
Common Voice 11.0 (Hindi)	WER	54.8%
Open ASR Leaderboard (mean)	WER	7.94%
AMI	WER	16.73%
Earnings22	WER	12.91%
Gigaspeech	WER	10.76%

These results come from the original model card and the Whisper paper. A newer variant, large-v2, improves upon these scores (e.g., LibriSpeech clean WER 2.83%) with the same architecture and is available separately on gigarouter.

Model details

Parameters: 1,550 million
Architecture: Transformer encoder-decoder
Training data: 680k hours of labelled speech
Languages: multilingual (outputs in 99+ languages)
Tasks: transcription, translation, language identification

best for

·Transcribing English speech to text
·Translating non-English speech to English text
·Transcribing long audio files via chunking

FAQ

What is Whisper Large best for?

It excels at multilingual speech recognition and speech translation without fine-tuning, achieving strong zero-shot performance on many benchmarks.

What languages does it support?

Whisper Large is a multilingual model supporting many languages; it can transcribe in the same language or translate to English.

How large is the model and what are the VRAM requirements?

It has 1.55 billion parameters and requires approximately 10 GB of VRAM for inference.

What input format does the model expect?

The model expects audio samples preprocessed into log-Mel spectrograms; for long audio, chunks of 30 seconds are used.

How can I call Whisper Large via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key and send audio data as per the gigarouter documentation.

not yet live

We're benchmarking and onboarding Whisper Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo