Moonshine Base

UsefulSensors/moonshine-base

published Nov 2024 · updated Jan 2025

Moonshine Base is an automatic speech recognition (ASR) model that transcribes English speech to text, optimized for live transcription and voice commands.

status

coming soon

API providers

downloads / mo

40.6K

license

mit

specs

Task	Automatic Speech Recognition (ASR) – English speech-to-text
Architecture	Encoder-decoder transformer with Rotary Position Embedding (RoPE)
Parameters	61 million
Languages	English only

about this model

Moonshine is an automatic speech recognition (ASR) model that transcribes English speech audio into English text, optimized for live transcription and voice command processing on resource-constrained hardware.

Architecture and Key Strengths

Moonshine uses an encoder-decoder transformer architecture with Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. It is trained on speech segments of varying lengths without zero-padding, improving encoder inference efficiency. The model is available in two sizes: Tiny (27M parameters) and Base (61M parameters), both English-only.

Performance

When benchmarked against OpenAI's Whisper tiny-en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. The models are trained on 200,000 hours of audio and corresponding transcripts collected from the internet and openly available datasets.

Limitations

Like other sequence-to-sequence ASR models, Moonshine may produce hallucinations (text not present in the audio) or repetitive text, particularly with short audio segments or segments where words are cut off at the beginning or end. The model is intended for English speech transcription only and has not been robustly evaluated for classification, speaker identification, or other non-transcription tasks.

Additional Resources

best for

·Live transcription of English speech in real-time applications
·Voice command processing for resource-constrained devices
·Low-latency voice agents and conversational AI

FAQ

What is the Moonshine Base model?

Moonshine Base is a 61-million-parameter English ASR model from Useful Sensors, designed for fast, on-device speech transcription.

How does Moonshine Base compare to Whisper in speed and size?

Moonshine Tiny (27M) achieves a 5x compute reduction over Whisper tiny-en with no increase in word error rate; Moonshine Base (61M) offers higher accuracy while remaining efficient.

Is Moonshine Base multilingual?

No, it is English-only. The model card lists only an English-only version for the base size.

What are the input and output formats?

Input is raw audio waveform (sampled at the processor’s sampling rate), output is transcribed English text.

How can I use Moonshine Base via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key to send audio and receive transcriptions.

not yet live

We're benchmarking and onboarding Moonshine Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo