mms-300m-1130-forced-aligner
MahmoudAshraf/mms-300m-1130-forced-aligner
A popular open speech-to-text model, with 3.2M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
about this model
Key Strengths
- Efficient forced alignment with significantly lower memory usage compared to the TorchAudio forced alignment API.
- Supports multilingual text preprocessing with romanization via ISO-639-3 language codes.
- Designed for batch processing of audio emissions to improve throughput.
Best For
- Aligning transcribed text to audio at the word level for applications such as subtitle generation, pronunciation analysis, and audio segmentation.
- Use cases requiring accurate timestamp extraction from speech data with minimal computational overhead.
Performance
The model leverages the MMS-300M architecture, which has demonstrated strong results on forced alignment benchmarks. The conversion to Hugging Face format enables seamless integration with modern ASR pipelines while maintaining the alignment accuracy of the original checkpoint.
Workflow Overview
The model processes audio and text through the following pipeline: load audio and alignment model, generate emissions, preprocess text (with optional romanization), compute alignments, extract spans, and produce word-level timestamps. This is handled automatically when using the model via the gigarouter API.
We're benchmarking and onboarding mms-300m-1130-forced-aligner as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.