skip to content
gigarouter gigarouter
models / speech-to-text · coming soon

ARK-ASR 0.6B

AutoArk-AI/ARK-ASR-0.6B

published May 2026 · updated Jun 2026

ARK-ASR 0.6B is an automatic speech recognition model that uses a compact 0.6B-scale decoder LLM with a Whisper-style audio encoder and on-policy distillation to transcribe speech in 19 languages.

status
coming soon
API providers
0
downloads / mo
1.6K
license
apache-2.0

specs

TaskAutomatic Speech Recognition (ASR)
ArchitectureAudio-capable autoregressive Transformer with Whisper-style encoder, MLP adapter, and Qwen2 decoder
Parameters0.6B decoder LLM + 0.6B-scale audio encoder
LicenseApache 2.0

about this model

ARK-ASR-0.6B is an automatic speech recognition model that uses teacher-data adaptation and online policy distillation (TD + OPD) to produce compact, multilingual ASR from a 0.6B-scale decoder LLM paired with a separate 0.6B-scale Whisper-style audio encoder and MLP adapter. The model supports 19 languages: Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian.

Architecture

Audio is encoded by a Whisper-style encoder with RoPE, merged through an MLP adapter, and injected into a Qwen2 decoder by replacing audio placeholder token embeddings before transcript generation. The model operates at 16 kHz sampling rate and uses autoregressive Transformers with custom arkasr remote code.

ARK-ASR architecture diagram showing audio encoder, MLP adapter, and Qwen2 decoder pipeline

Performance

Evaluated across 7 English and 3 Chinese benchmarks, with lower CER/WER being better:

English WERAMIEarnings22GigaSpeechLS CleanLS OtherSPGISpeechVoxPopuliAvg
Ark-ASR11.54%10.07%8.95%1.87%3.89%2.89%6.63%6.55%
Qwen3-ASR-0.6B11.66%11.06%9.14%2.13%4.45%3.03%7.07%6.93%
Qwen3-ASR-1.7B10.56%10.25%8.74%1.63%3.40%2.84%6.35%6.25%
Chinese CERAISHELL-1Wenet-meetingWenet-netAvg
Ark-ASR2.02%5.92%4.96%4.30%
Qwen3-ASR-0.6B2.07%5.57%5.45%4.36%
Qwen3-ASR-1.7B1.50%4.69%4.55%3.58%

Ark-ASR achieves an average English WER of 6.55% and Chinese CER of 4.30%, outperforming the same-scale Qwen3-ASR-0.6B baseline (6.93% WER, 4.36% CER) across all reported benchmarks. The model was trained on only 100k hours of speech, compared to the 20M hours used for the Qwen3-Omni AuT encoder. The larger Qwen3-ASR-1.7B remains stronger overall, but the OPD training recipe substantially closes the gap for compact models under a much smaller audio budget.

Methodology

The OPD training recipe uses 4 student rollouts per prompt with a union top-k KL objective, combining teacher top-k and student top-k tokens. The foundational OPD paper has been accepted to ICML 2026 FoGen Workshop. The model is released under the Apache-2.0 license, with the accompanying paper published under CC BY 4.0.

best for

FAQ

What languages does ARK-ASR 0.6B support?

It supports 19 languages: Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian.

How does ARK-ASR 0.6B compare in size to other ASR models?

It has a 0.6B-parameter decoder LLM plus a 0.6B-scale audio encoder, making it a compact model that outperforms the same-scale Qwen3-ASR-0.6B on most benchmarks while being much smaller than the 1.7B Qwen3-ASR variant.

What is the license for ARK-ASR 0.6B?

The model weights are released under the Apache 2.0 license.

What audio format does the model expect?

The model expects 16 kHz mono audio. It can be provided as a file path or raw audio array, and the processor handles resampling and padding.

How can I call ARK-ASR 0.6B via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending the audio file or URL in a request formatted for the ASR task.

not yet live

We're benchmarking and onboarding ARK-ASR 0.6B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →