whisper base.en

openai/whisper-base.en

published Sep 2022 · updated Jan 2024

A popular open speech-to-text model, with 30.8K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status

coming soon

API providers

downloads / mo

30.8K

license

apache-2.0

about this model

openai/whisper-base.en is an automatic speech recognition (ASR) model that transcribes English audio into text. It is a Transformer sequence-to-sequence (encoder-decoder) model with 74 million parameters, trained on 680,000 hours of weakly supervised speech data. As an English-only variant of the Whisper family, it is optimized for English speech recognition and generally outperforms the multilingual base model on English tasks.

Key Capabilities

Robust to accents, background noise, and technical language without requiring fine-tuning.
Zero-shot generalization across multiple domains; competitive with prior fully supervised results.
Supports audio segments up to 30 seconds natively; arbitrary-length transcription via chunking with timestamp prediction.

Performance Benchmarks

On the LibriSpeech test-clean dataset, the model achieves a word error rate (WER) of 4.27%. Inference requires approximately 1 GB of VRAM and runs at roughly 7× the speed of the large model variant.

Training Data

The model was trained on 680,000 hours of audio and transcripts: 65% English audio with English transcripts (438,000 hours), 18% non-English audio with English transcripts (126,000 hours), and 17% non-English audio with matching non-English transcripts (117,000 hours across 98 languages).

Limitations

As a weakly supervised model, it may produce hallucinated text not present in the audio. Performance varies across languages, accents, and demographic groups. Repetitive text generation can occur but is partially mitigated by beam search and temperature scheduling.

not yet live

We're benchmarking and onboarding whisper base.en as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo