Parakeet CTC 0.6B
nvidia/parakeet-ctc-0.6b
published Dec 2023 · updated Sep 2025
Parakeet CTC 0.6B is an automatic speech recognition model using FastConformer-CTC architecture that transcribes English speech into lowercase text.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | FastConformer CTC (8x downsampling) |
| Parameters | 0.6B (600 million) |
| License | CC-BY-4.0 |
about this model
Parakeet CTC 0.6B is an automatic speech recognition (ASR) model that transcribes speech in lower case English alphabet. It is based on the FastConformer CTC architecture with approximately 600 million parameters, jointly developed by NVIDIA NeMo and Suno.ai.
Architecture and Training
FastConformer is an optimized variant of the Conformer model, achieving 2.8x faster inference than the original Conformer while supporting scaling to billion-parameter models without core architectural changes. The model was trained using CTC loss on 64,000 hours of English speech, comprising 40,000 hours of private data and 24,000 hours of public datasets including Librispeech, Fisher Corpus, Switchboard, WSJ, VCTK, VoxPopuli, Europarl-ASR, Multilingual Librispeech (2,000-hour subset), Mozilla Common Voice (v7.0), and People’s Speech (12,000-hour subset).
With post-training replacement of global attention by limited context attention, the model can transcribe long-form speech up to 11 hours in duration. For further details, refer to the Fast Conformer paper (ASRU 2023, arXiv:2305.05084).
Performance Benchmarks
Word Error Rate (WER) with greedy decoding (no external language model) on standard benchmarks:
| AMI | Earnings-22 | Giga Speech | LS test-clean | SPGI Speech | TEDLIUM-v3 | Vox Populi | Common Voice |
|---|---|---|---|---|---|---|---|
| 16.30 | 14.14 | 10.35 | 1.87 | 3.76 | 4.11 | 3.78 | 7.00 |
Benchmarks are from the Hugging Face ASR Leaderboard. The model uses a SentencePiece Unigram tokenizer with vocabulary size 1024.
License: CC-BY-4.0.
best for
- ·Transcribing recorded meetings and earnings calls
- ·Real-time speech-to-text for voice commands and dictation
- ·Large-scale batch transcription of audio libraries
FAQ
It excels at transcribing English speech in various domains, including meetings, lectures, and voice assistants, with high accuracy.
It has 0.6B parameters and the FastConformer architecture provides 2.8x faster inference than the original Conformer.
It is licensed under CC-BY-4.0, allowing use with attribution.
It accepts 16000 Hz mono-channel WAV audio and outputs transcribed lowercase English text.
Use the OpenAI-compatible endpoint with your API key; refer to gigarouter documentation for exact details.
We're benchmarking and onboarding Parakeet CTC 0.6B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.