models / speech-to-text · coming soon

mms-300m-1130-forced-aligner

MahmoudAshraf/mms-300m-1130-forced-aligner

A popular open speech-to-text model, with 3.2M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status

coming soon

API providers

downloads / mo

3.2M

license

cc-by-nc-4.0

about this model

MahmoudAshraf/mms-300m-1130-forced-aligner is an automatic speech recognition (ASR) model optimized for forced alignment between text and audio. It is a conversion of the MMS-300M checkpoint, originally trained on a forced alignment dataset, from torchaudio to Hugging Face Transformers format.

Key Strengths

Efficient forced alignment with significantly lower memory usage compared to the TorchAudio forced alignment API.
Supports multilingual text preprocessing with romanization via ISO-639-3 language codes.
Designed for batch processing of audio emissions to improve throughput.

Best For

Aligning transcribed text to audio at the word level for applications such as subtitle generation, pronunciation analysis, and audio segmentation.
Use cases requiring accurate timestamp extraction from speech data with minimal computational overhead.

Performance

The model leverages the MMS-300M architecture, which has demonstrated strong results on forced alignment benchmarks. The conversion to Hugging Face format enables seamless integration with modern ASR pipelines while maintaining the alignment accuracy of the original checkpoint.

Workflow Overview

The model processes audio and text through the following pipeline: load audio and alignment model, generate emissions, preprocess text (with optional romanization), compute alignments, extract spans, and produce word-level timestamps. This is handled automatically when using the model via the gigarouter API.

not yet live

We're benchmarking and onboarding mms-300m-1130-forced-aligner as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.