skip to content
gigarouter gigarouter
models / speech-to-text · coming soon

Moonshine Streaming Small

UsefulSensors/moonshine-streaming-small

published Jan 2026 · updated Feb 2026

Moonshine Streaming Small is a 123M parameter streaming automatic speech recognition (ASR) model that pairs a lightweight audio frontend with a sliding-window Transformer encoder for low-latency transcription on edge hardware.

status
coming soon
API providers
0
downloads / mo
6.1K
license
mit

specs

TaskAutomatic Speech Recognition (ASR)
ArchitectureStreaming sliding-window Transformer encoder with autoregressive Transformer decoder
Parameters123M
LicenseNot specified in model card

about this model

Moonshine Streaming Small is an automatic speech recognition (ASR) model hosted on gigarouter, designed for low-latency streaming transcription on edge-class hardware. It pairs a lightweight 50 Hz audio frontend with a sliding-window Transformer encoder that uses bounded local attention and no positional embeddings (an ergodic encoder), enabling efficient streaming. The model has 123 million parameters across 10 encoder and 10 decoder layers with an encoder dimension of 620 and decoder dimension of 512.

Trained on approximately 300K hours of English speech data, Moonshine Streaming Small achieves competitive word error rates (WER) on standard benchmarks:

DatasetWER (%)
AMI12.54
Earnings-2213.53
GigaSpeech10.41
LibriSpeech (clean)2.49
LibriSpeech (other)6.78
SPGISpeech3.19
TED-LIUM3.77
VoxPopuli9.98
Average7.84

Key strengths include low initial latency due to the streaming encoder design and suitability for devices with 0.1–1 TOPS compute and sub-1 GB memory budgets. Known limitations: the autoregressive decoder causes latency proportional to transcript length; the current Hugging Face Transformers integration does not yet implement fully efficient streaming (falling back to flash-attention for sliding-window); and like other seq2seq models, it may hallucinate or repeat on short or noisy audio. Intended use cases include live captioning, voice commands, and real-time transcription on constrained hardware.

best for

FAQ

What is the primary use case for Moonshine Streaming Small?

It is designed for low-latency, on-device English speech transcription on platforms with roughly 0.1-1 TOPS and sub-1 GB memory budgets.

How does the Small model compare in size and speed to the Tiny and Medium variants?

The Small model has 123M parameters, compared to 34M for Tiny and 245M for Medium. It offers a balance of accuracy and efficiency, with an average WER of 7.84% across open ASR benchmarks.

What input format does the model expect?

The model expects audio sampled at the processor's sampling rate, processed via the AutoProcessor into tensors with attention masks.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, passing the model ID and audio input in the request.

What are the known limitations of this model?

The decoder is autoregressive so latency grows with transcript length, the Transformers implementation does not yet perform fully efficient streaming, and the model can hallucinate or repeat phrases on short or noisy audio.

not yet live

We're benchmarking and onboarding Moonshine Streaming Small as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →