Parakeet CTC 1.1B
nvidia/parakeet-ctc-1.1b
published Dec 2023 · updated Sep 2025
Parakeet CTC 1.1B is an automatic speech recognition model that transcribes English speech into lower-case text using a FastConformer-CTC architecture with 1.1 billion parameters.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | FastConformer-CTC |
| Parameters | 1.1B |
| License | CC-BY-4.0 |
about this model
Parakeet CTC 1.1B is an automatic speech recognition (ASR) model that transcribes English speech into lower-case text. It is an XXL version of the FastConformer CTC architecture with approximately 1.1 billion parameters, jointly developed by NVIDIA NeMo and Suno.ai.
The model is built on the Fast Conformer architecture, which is 2.8x faster than the original Conformer while supporting scaling to billion-parameter models. It uses CTC loss and a SentencePiece Unigram tokenizer with a vocabulary size of 1024. The model supports transcription of long-form speech up to 11 hours via post-training limited context attention with a global token. The architecture was accepted at ASRU 2023.
Training Data
The model was trained on 64,000 hours of English speech, comprising 40,000 hours of private data and 24,000 hours from public datasets including Librispeech, Fisher Corpus, Switchboard-1, WSJ, VCTK, VoxPopuli, Europarl-ASR, Multilingual Librispeech (MLS EN), Mozilla Common Voice (v7.0), and People's Speech.
Performance
Word Error Rate (WER%) with greedy decoding (no external language model) on standard benchmarks:
| Benchmark | WER (%) |
|---|---|
| AMI | 15.62 |
| Earnings-22 | 13.69 |
| Giga Speech | 10.27 |
| LibriSpeech test-clean | 1.83 |
| SPGI Speech | 3.54 |
| TEDLIUM-v3 | 4.20 |
| Vox Populi | 3.54 |
| Common Voice | 6.53 |
Additional benchmark results are available on the HuggingFace ASR Leaderboard.
Key Capabilities
- Accepts 16 kHz mono-channel audio input
- Supports transcription of long-form audio up to 11 hours
- Fast Conformer architecture delivers 2.8x speed improvement over original Conformer
- Licensed under CC-BY-4.0
best for
- ·Transcribing English speech from audio files or streams
- ·Processing long-form audio up to 11 hours
- ·High-accuracy transcription in production ASR pipelines
FAQ
It accepts 16 kHz mono-channel WAV audio as input.
It outputs transcribed speech as a lowercase English string.
The FastConformer architecture is 2.8x faster than the original Conformer.
It is licensed under CC-BY-4.0.
Use the gigarouter OpenAI-compatible endpoint with your API key to send audio and receive transcriptions.
We're benchmarking and onboarding Parakeet CTC 1.1B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.