Canary 1B Flash
nvidia/canary-1b-flash
published Mar 2025 · updated Jun 2026
Canary 1B Flash is a multilingual multitasking speech model that performs automatic speech recognition (ASR) in English, German, French, and Spanish, and speech-to-text translation between English and German, French, or Spanish, with optional punctuation/capitalization and word-level or segment-level timestamps.
specs
| Task | ASR and AST (Automatic Speech Translation) |
| Architecture | FastConformer encoder with 32 layers + Transformer decoder with 4 layers |
| Parameters | 883 million |
| License | CC-BY-4.0 |
about this model
nvidia/canary-1b-flash is an automatic speech recognition (ASR) model that transcribes and translates speech across four languages. With 883 million parameters, it supports ASR in English, German, French, and Spanish, and speech-to-text translation from English into German, French, or Spanish, and from those languages into English. Output can include punctuation and capitalization (PnC) and optional word-level or segment-level timestamps (experimental). The model achieves an inference speed of more than 1,000 RTFx on the OpenASR leaderboard benchmark when run on an NVIDIA A100 GPU.
Architecture and Training
Canary-1b-flash uses an encoder-decoder architecture with a FastConformer encoder and a Transformer decoder. Task tokens (e.g., target language, task type, toggle timestamps) are fed into the decoder to control generation. The model was trained on 85,000 hours of speech data, comprising 31,000 hours of public datasets, 20,000 hours from Suno, and 34,000 hours of in-house data. The public data includes English, German, French, and Spanish speech from sources such as LibriSpeech, Multilingual LibriSpeech, Common Voice, VoxPopuli, and others.
Benchmark Performance
Word error rate (WER) on the Hugging Face OpenASR leaderboard (without PnC, using greedy decoding, text normalized with whisper-normalizer):
| Dataset | WER (%) |
|---|---|
| AMI | 13.11 |
| GigaSpeech | 9.85 |
| LibriSpeech Clean | 1.48 |
| LibriSpeech Other | 2.87 |
| Earnings22 | 12.79 |
| SPGISpeech | 1.95 |
| Tedlium | 3.12 |
| VoxPopuli | 5.63 |
Inference speed on an NVIDIA A100 (batch size 128) is 1,045.75 RTFx.
Licensing
This model is released under the CC-BY-4.0 license and is available for commercial use.
best for
- ·Transcribing English, German, French, and Spanish audio with punctuation
- ·Translating speech from English to German, French, or Spanish and vice versa
- ·Generating word-level and segment-level timestamps for audio in supported languages
FAQ
It supports English, German, French, and Spanish for ASR and translation.
It has 883M parameters and achieves over 1000 RTFx inference speed on OpenASR benchmarks.
Yes, it can produce word-level and segment-level timestamps for audio in English, German, French, and Spanish (experimental).
It is released under CC-BY-4.0, which allows commercial use.
Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name "canary-1b-flash".
We're benchmarking and onboarding Canary 1B Flash as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.