MOSS Transcribe Preview 2B

OpenMOSS-Team/MOSS-Transcribe-preview-2B

published Jun 2026 · updated Jun 2026

MOSS Transcribe Preview 2B is an ASR model that transcribes English speech to text using a Qwen3-1.7B language model backbone with a Qwen3-Omni-MoE audio encoder and a gated-MLP adapter.

status

coming soon

API providers

downloads / mo

879

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR)
Architecture	Qwen3-1.7B base + Qwen3-Omni-MoE audio encoder + gated-MLP adapter
Parameters	~2.4B
License	Apache-2.0
Language	English
Average WER	4.87% on Open ASR Leaderboard

about this model

MOSS-Transcribe-preview-2B is an automatic speech recognition model for English that pairs a Qwen3-1.7B-base language-model backbone with a Qwen3-Omni-MoE audio encoder. A gated-MLP adapter projects audio features into the embedding space. The model has approximately 2.4B parameters and is distributed as a single bfloat16 shard.

Training and Evaluation

The model is trained on public English ASR corpora and fine-tuned with reinforcement learning on the Open ASR Leaderboard training splits. Evaluation is performed on the Open ASR Leaderboard test sets using greedy decoding (num_beams=1, max_new_tokens=512) with a single dataset-agnostic chat template. Results are scored with the leaderboard’s standardized English normalizer and word-level edit distance.

Dataset	WER (%)
AMI	8.37
Earnings22	7.84
GigaSpeech	6.78
LibriSpeech test.clean	1.21
LibriSpeech test.other	2.84
SPGISpeech	1.63
VoxPopuli	5.39
Average	4.87

Architecture diagram of MOSS-Transcribe model showing Qwen3 language model backbone and audio encoder with gated-MLP adapter.

The model uses a 16 kHz sampling rate with Whisper log-mel filterbank features (128 Mel bins, FFT size 400, hop length 160). It is designed for English speech; performance may degrade on non-English speech, heavy accents, noisy recordings, overlapping speakers, far-field audio, or domain-specific terminology. Outputs should be manually reviewed before use in high-stakes applications.

best for

·Transcribing English meeting recordings
·Transcribing English podcasts
·Dictation and note-taking

FAQ

What audio input format does the model expect?

The model expects 16 kHz mono audio. The frontend uses Whisper log-mel filterbank with 128 mel bins, FFT size 400, hop length 160.

What is the output format?

Plain text transcript of the English speech.

What is the license?

The model is released under Apache-2.0 license.

How does this model compare in size and speed to other ASR models?

It has around 2.4B parameters and uses greedy decoding. Its average WER on the Open ASR Leaderboard is 4.87%, competing with larger models.

How can I call this model via API?

Use the gigarouter OpenAI-compatible endpoint with your API key. The model accepts audio input and returns transcription text.

not yet live

We're benchmarking and onboarding MOSS Transcribe Preview 2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo