models / speech-to-text · coming soon

Parakeet RNNT 0.6B

nvidia/parakeet-rnnt-0.6b

published Dec 2023 · updated Jun 2026

Parakeet RNNT 0.6B is an ASR model that transcribes English speech into lower-case text.

status

coming soon

API providers

downloads / mo

36.6K

license

cc-by-4.0

specs

Task	Automatic Speech Recognition (ASR)
Architecture	FastConformer-Transducer (RNNT)
Parameters	0.6B
License	CC-BY-4.0

about this model

Parakeet-RNNT-0.6B is an automatic speech recognition (ASR) model that transcribes English speech into lower-case text. Developed jointly by NVIDIA NeMo and Suno.ai, it is a FastConformer Transducer model with approximately 600 million parameters. The FastConformer architecture is an optimized version of the Conformer model that uses 8x depthwise-separable convolutional downsampling, achieving 2.8x faster inference than the original Conformer while supporting scaling to billion-parameter models.

Key Capabilities

The model accepts 16 kHz mono-channel audio as input and outputs transcribed text as a string. It uses a SentencePiece Unigram tokenizer with a vocabulary size of 1024. The model was trained on 64,000 hours of English speech, comprising a private 40,000-hour subset and 24,000 hours from public datasets including LibriSpeech, Fisher Corpus, Switchboard-1, WSJ, VCTK, VoxPopuli, Europarl-ASR, MLS English, Mozilla Common Voice, and People's Speech.

Benchmark Performance

Word Error Rate (WER) with greedy decoding on standard benchmarks:

Dataset	WER (%)
LS test-clean	1.63
SPGI Speech	3.06
TEDLIUM-v3	3.47
Vox Populi	3.86
Common Voice	8.07
Giga Speech	10.07
Earnings-22	14.78
AMI	17.55

These are greedy decoding results without an external language model. The model supports transcription of long-form audio up to 11 hours through limited context attention, applied post-training with fine-tuning using a global token.

Licensing

This model is released under the CC-BY-4.0 license.

best for

·Transcribing English meeting recordings
·Voice command transcription
·Automated captioning of English audio

FAQ

What is the primary use of Parakeet RNNT 0.6B?

It transcribes English speech into lower-case text for general ASR tasks.

What audio format does the model accept?

It accepts 16000 Hz mono-channel WAV files.

How can I use this model via gigarouter?

Use the gigarouter OpenAI-compatible endpoint with an API key to send audio and receive transcription.

What is the license for this model?

It is licensed under CC-BY-4.0.

What is the model's Word Error Rate on standard benchmarks?

It achieves WER of 1.63% on LibriSpeech test-clean and 14.78% on Earnings-22 with greedy decoding.

not yet live

We're benchmarking and onboarding Parakeet RNNT 0.6B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo