MOSS Transcribe Preview 2B
OpenMOSS-Team/MOSS-Transcribe-preview-2B
published Jun 2026 · updated Jun 2026
MOSS Transcribe Preview 2B is an ASR model that transcribes English speech to text using a Qwen3-1.7B language model backbone with a Qwen3-Omni-MoE audio encoder and a gated-MLP adapter.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | Qwen3-1.7B base + Qwen3-Omni-MoE audio encoder + gated-MLP adapter |
| Parameters | ~2.4B |
| License | Apache-2.0 |
| Language | English |
| Average WER | 4.87% on Open ASR Leaderboard |
about this model
MOSS-Transcribe-preview-2B is an automatic speech recognition model for English that pairs a Qwen3-1.7B-base language-model backbone with a Qwen3-Omni-MoE audio encoder. A gated-MLP adapter projects audio features into the embedding space. The model has approximately 2.4B parameters and is distributed as a single bfloat16 shard.
Training and Evaluation
The model is trained on public English ASR corpora and fine-tuned with reinforcement learning on the Open ASR Leaderboard training splits. Evaluation is performed on the Open ASR Leaderboard test sets using greedy decoding (num_beams=1, max_new_tokens=512) with a single dataset-agnostic chat template. Results are scored with the leaderboard’s standardized English normalizer and word-level edit distance.
| Dataset | WER (%) |
|---|---|
| AMI | 8.37 |
| Earnings22 | 7.84 |
| GigaSpeech | 6.78 |
| LibriSpeech test.clean | 1.21 |
| LibriSpeech test.other | 2.84 |
| SPGISpeech | 1.63 |
| VoxPopuli | 5.39 |
| Average | 4.87 |
The model uses a 16 kHz sampling rate with Whisper log-mel filterbank features (128 Mel bins, FFT size 400, hop length 160). It is designed for English speech; performance may degrade on non-English speech, heavy accents, noisy recordings, overlapping speakers, far-field audio, or domain-specific terminology. Outputs should be manually reviewed before use in high-stakes applications.
best for
- ·Transcribing English meeting recordings
- ·Transcribing English podcasts
- ·Dictation and note-taking
FAQ
The model expects 16 kHz mono audio. The frontend uses Whisper log-mel filterbank with 128 mel bins, FFT size 400, hop length 160.
Plain text transcript of the English speech.
The model is released under Apache-2.0 license.
It has around 2.4B parameters and uses greedy decoding. Its average WER on the Open ASR Leaderboard is 4.87%, competing with larger models.
Use the gigarouter OpenAI-compatible endpoint with your API key. The model accepts audio input and returns transcription text.
We're benchmarking and onboarding MOSS Transcribe Preview 2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.