Moonshine Streaming Medium

UsefulSensors/moonshine-streaming-medium

published Jan 2026 · updated Feb 2026

Moonshine Streaming Medium is a 245M parameter streaming automatic speech recognition model that uses a sliding-window Transformer encoder for low-latency English transcription on edge hardware.

status

coming soon

API providers

downloads / mo

12.9K

license

mit

specs

Task	Automatic Speech Recognition (ASR)
Architecture	Sequence-to-sequence with sliding-window Transformer encoder and autoregressive decoder
Parameters	245M
License	Unknown

about this model

Moonshine Streaming Medium is an automatic speech recognition (ASR) model that uses a sliding-window Transformer encoder with bounded local attention and no positional embeddings to deliver low-latency streaming transcription on edge-class hardware. It pairs a lightweight 50 Hz audio frontend with an autoregressive decoder enhanced by a positional-information adapter. The medium variant contains 245M parameters (14 encoder / 14 decoder layers, 768 encoder dim, 640 decoder dim).

Architecture and Key Strengths

The encoder’s ergodic design and windowed attention (16,4) in lookahead layers enable an 80 ms lookahead, minimizing time-to-first-token. The model attains accuracy on par with models six times its size while running significantly faster, as reported in the accompanying paper. On the Open ASR Leaderboard, the medium model achieves a mean WER of 6.66% (rank 13) and a real-time factor (RTFX) of 448.15 (rank 34).

Benchmark Performance

Word error rates (WER %) across standard English benchmarks:

Dataset	Medium (245M)
AMI	10.68
Earnings-22	11.90
GigaSpeech	9.46
LibriSpeech (clean)	2.08
LibriSpeech (other)	5.00
SPGISpeech	2.58
TED-LIUM	2.99
VoxPopuli	8.54
Average	6.65

Training and Limitations

Trained on roughly 300K hours of English speech data (public web data, open datasets, and internally prepared sources). The model is intended for low-latency, on-device transcription; the autoregressive decoder means full-output latency grows with transcript length even when time-to-first-token is low. Like other seq2seq ASR models, it may hallucinate or repeat phrases in short or noisy segments.

best for

·Live captioning on edge devices
·Real-time voice command recognition
·On-device transcription with low memory and compute

FAQ

What is the average Word Error Rate (WER) for the Medium model?

The average WER across Open ASR benchmarks is 6.65% (model card) or 6.66% (Open ASR Leaderboard).

What is the RTFX (Real-Time Factor) for the Medium model?

The RTFX on the Open ASR Leaderboard is 448.15.

What languages does this model support?

The model is trained and evaluated only on English, though the Moonshine framework supports additional languages for STT.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key and the model ID moonshine-streaming-medium.

What are the known limitations of this model?

The decoder is autoregressive, so latency grows with transcript length; it can hallucinate words on short or noisy audio; the Transformers implementation does not yet perform fully efficient streaming.

not yet live

We're benchmarking and onboarding Moonshine Streaming Medium as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo