skip to content
gigarouter gigarouter
models / audio text to text · coming soon

VibeVoice ASR

microsoft/VibeVoice-ASR-HF

published Mar 2026 · updated Mar 2026

VibeVoice ASR is a unified speech-to-text model that transcribes up to 60 minutes of long-form audio in a single pass, generating structured output with speaker diarization, timestamps, and content in over 50 languages.

status
coming soon
API providers
0
downloads / mo
644.8K
license
mit

specs

TaskAutomatic Speech Recognition (ASR) with speaker diarization and timestamping
ArchitectureVibeVoice-ASR-7B
Parameters7 billion
LicenseNot specified in the model card

about this model

VibeVoice-ASR is a unified speech-to-text model that processes up to 60 minutes of long-form audio in a single pass and generates structured transcriptions identifying who spoke what and when, with support for over 50 languages, code-switching, and customized hotwords.

Key Capabilities

  • Single-pass processing of 60-minute audio — maintains global context and consistent speaker tracking across the full duration.
  • Rich transcription output — jointly performs automatic speech recognition (ASR), speaker diarization, and timestamping in one end-to-end generation step.
  • Customized hotwords — users can supply domain-specific terms or names (e.g., proper nouns, technical jargon) to guide recognition and improve accuracy.
  • Multilingual and code-switching support — natively handles over 50 languages without requiring explicit language setting; switches seamlessly within and across utterances.

Benchmark Performance

The model (VibeVoice-ASR-7B, 7 billion parameters) uses a continuous speech tokenizer operating at a 7.5 Hz frame rate. Representative results on public datasets:

DatasetLanguageDER (%)WER (%)
MLC-Challenge (average)Multilingual3.4212.07
AISHELL-4Chinese21.40
AMI-IHMEnglish18.81

DER = diarization error rate, WER = word error rate. Full results are in the technical report.

VibeVoice-ASR architecture overview

Additional Capabilities

  • Prompt-based context injection — provide textual context (e.g., a topic summary) to improve accuracy on domain-specific terminology and polyphonic character disambiguation.
  • Batch inference — process multiple audio files concurrently.
  • vLLM inference support — enables faster decoding when deployed with vLLM.
  • Fine-tuning — LoRA fine-tuning code is publicly available for adapting the model to custom domains.

best for

FAQ

What is the maximum audio length VibeVoice ASR can handle in one pass?

It can process up to 60 minutes of continuous audio within a 64K token length.

Does VibeVoice ASR support speaker diarization?

Yes, it jointly performs ASR, speaker diarization, and timestamping, outputting who said what and when.

How many languages does VibeVoice ASR support?

It supports over 50 languages and handles code-switching natively without requiring an explicit language setting.

How can I use custom hotwords or context with this model?

You can pass a prompt string to the apply_transcription_request method to guide recognition on domain-specific terms.

How do I call VibeVoice ASR via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key to send audio and receive structured transcriptions.

not yet live

We're benchmarking and onboarding VibeVoice ASR as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.