tasks / voice activity detection

Hosted voice activity detection models

1 models · 0 live as APIs · benchmarked & compared

Voice activity detection (VAD) models distinguish speech from non-speech segments in audio, solving the fundamental problem of isolating human voice from silence, noise, or overlapping non-speech sounds. In real-world systems, VAD is used to trim silent portions of voice recordings, trigger wake-word detection, reduce bandwidth in telephony, and improve downstream speech recognition accuracy by feeding only relevant audio frames to ASR models. For example, a call center analytics platform relies on VAD to segment agent-customer conversations into speaker turns, bypassing hold music or long pauses.

In production, VAD models typically run as a preprocessing step before more expensive speech tasks. They are deployed as a lightweight filter on streaming audio, often operating in real-time with low latency. A common architecture places VAD at the edge or in a media server, forwarding only speech segments to a cloud-based speech-to-text service.

Choosing between VAD models involves balancing latency, accuracy, and model size. Smaller models offer faster inference and lower compute cost but may miss short utterances or misclassify background noise as speech. Larger models like pyannote/segmentation-3.0 provide higher robustness across diverse acoustic conditions at the expense of higher memory and latency. For most call volumes, calling a hosted API eliminates the overhead of model deployment, scaling, and hardware maintenance, allowing teams to focus on their core application.

compare

model	params	downloads/mo	price	status
pyannote/segmentation-3.0	-	6.5M	at launch	coming soon

get a key + $25 free →docs