Moonshine Streaming Medium
UsefulSensors/moonshine-streaming-medium
published Jan 2026 · updated Feb 2026
Moonshine Streaming Medium is a 245M parameter streaming automatic speech recognition model that uses a sliding-window Transformer encoder for low-latency English transcription on edge hardware.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | Sequence-to-sequence with sliding-window Transformer encoder and autoregressive decoder |
| Parameters | 245M |
| License | Unknown |
about this model
Moonshine Streaming Medium is an automatic speech recognition (ASR) model that uses a sliding-window Transformer encoder with bounded local attention and no positional embeddings to deliver low-latency streaming transcription on edge-class hardware. It pairs a lightweight 50 Hz audio frontend with an autoregressive decoder enhanced by a positional-information adapter. The medium variant contains 245M parameters (14 encoder / 14 decoder layers, 768 encoder dim, 640 decoder dim).
Architecture and Key Strengths
The encoder’s ergodic design and windowed attention (16,4) in lookahead layers enable an 80 ms lookahead, minimizing time-to-first-token. The model attains accuracy on par with models six times its size while running significantly faster, as reported in the accompanying paper. On the Open ASR Leaderboard, the medium model achieves a mean WER of 6.66% (rank 13) and a real-time factor (RTFX) of 448.15 (rank 34).
Benchmark Performance
Word error rates (WER %) across standard English benchmarks:
| Dataset | Medium (245M) |
|---|---|
| AMI | 10.68 |
| Earnings-22 | 11.90 |
| GigaSpeech | 9.46 |
| LibriSpeech (clean) | 2.08 |
| LibriSpeech (other) | 5.00 |
| SPGISpeech | 2.58 |
| TED-LIUM | 2.99 |
| VoxPopuli | 8.54 |
| Average | 6.65 |
Training and Limitations
Trained on roughly 300K hours of English speech data (public web data, open datasets, and internally prepared sources). The model is intended for low-latency, on-device transcription; the autoregressive decoder means full-output latency grows with transcript length even when time-to-first-token is low. Like other seq2seq ASR models, it may hallucinate or repeat phrases in short or noisy segments.
best for
- ·Live captioning on edge devices
- ·Real-time voice command recognition
- ·On-device transcription with low memory and compute
FAQ
The average WER across Open ASR benchmarks is 6.65% (model card) or 6.66% (Open ASR Leaderboard).
The RTFX on the Open ASR Leaderboard is 448.15.
The model is trained and evaluated only on English, though the Moonshine framework supports additional languages for STT.
Use the gigarouter OpenAI-compatible endpoint with your API key and the model ID moonshine-streaming-medium.
The decoder is autoregressive, so latency grows with transcript length; it can hallucinate words on short or noisy audio; the Transformers implementation does not yet perform fully efficient streaming.
We're benchmarking and onboarding Moonshine Streaming Medium as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.