skip to content
gigarouter gigarouter
models / speech-to-text · coming soon

Whisper Large

openai/whisper-large

published Sep 2022 · updated Feb 2024

Whisper Large is an ASR and speech translation model that transcribes and translates multilingual audio using a transformer encoder-decoder trained on 680k hours of weakly supervised data.

status
coming soon
API providers
0
downloads / mo
35K
license
apache-2.0

specs

TaskAutomatic Speech Recognition and Speech Translation
ArchitectureTransformer Encoder-Decoder
Parameters1.55B

about this model

Whisper-large is an automatic speech recognition (ASR) model that performs multilingual speech recognition, speech translation, and language identification without fine-tuning. It is a Transformer encoder-decoder (sequence-to-sequence) model trained on 680,000 hours of weakly-supervised multilingual and multitask data. The model can transcribe audio in the same language or translate it into a different language, and it can handle arbitrary-length audio via chunking.

Key strengths

  • Zero-shot generalization – matches or exceeds prior fully-supervised results on standard benchmarks without dataset-specific fine-tuning.
  • Multitask capability – a single model replaces separate components for ASR, translation, language identification, and voice activity detection.
  • Long-form support – transcribes audio of arbitrary length using 30-second chunking with batched inference and optional timestamp prediction.

Benchmark performance (Whisper-large)

DatasetMetricScore
LibriSpeech cleanWER3.0%
LibriSpeech otherWER5.4%
Common Voice 11.0 (Hindi)WER54.8%
Open ASR Leaderboard (mean)WER7.94%
AMIWER16.73%
Earnings22WER12.91%
GigaspeechWER10.76%

These results come from the original model card and the Whisper paper. A newer variant, large-v2, improves upon these scores (e.g., LibriSpeech clean WER 2.83%) with the same architecture and is available separately on gigarouter.

Model details

  • Parameters: 1,550 million
  • Architecture: Transformer encoder-decoder
  • Training data: 680k hours of labelled speech
  • Languages: multilingual (outputs in 99+ languages)
  • Tasks: transcription, translation, language identification

best for

FAQ

What is Whisper Large best for?

It excels at multilingual speech recognition and speech translation without fine-tuning, achieving strong zero-shot performance on many benchmarks.

What languages does it support?

Whisper Large is a multilingual model supporting many languages; it can transcribe in the same language or translate to English.

How large is the model and what are the VRAM requirements?

It has 1.55 billion parameters and requires approximately 10 GB of VRAM for inference.

What input format does the model expect?

The model expects audio samples preprocessed into log-Mel spectrograms; for long audio, chunks of 30 seconds are used.

How can I call Whisper Large via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key and send audio data as per the gigarouter documentation.

not yet live

We're benchmarking and onboarding Whisper Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →