Hosted speech-to-text models
30 models · 0 live as APIs · benchmarked & compared
Speech-to-text models convert spoken audio into written text, enabling applications such as real-time captioning, meeting transcription, voice-controlled interfaces, and automated subtitling. Speaker diarization models—such as pyannote/speaker-diarization-3.1—extend this by identifying who spoke when, which is critical for multi-speaker recordings like conference calls or interviews.
In production, these models are typically deployed in pipelines that include voice activity detection, language identification, and post-processing for punctuation and formatting. The choice among models involves a trade-off between transcription accuracy, latency, and computational cost. For example, openai/whisper-base offers a fast, compact option, while larger variants or specialized models like jonatasgrosman/wav2vec2-large-xlsr-53-japanese are tuned for specific languages or higher accuracy at the expense of speed and memory.
This page lists 30 speech-to-text models (0 currently live, the remainder being onboarded), including pyannote/speaker-diarization-3.1, argmaxinc/whisperkit-coreml, openai/whisper-base, and several wav2vec2 variants. Calling a