skip to content
gigarouter gigarouter

Segmentation 3.0

pyannote/segmentation-3.0

published Sep 2023 · updated May 2024

Segmentation 3.0 is a voice-activity-detection model that outputs speaker diarization as a powerset multi-class matrix for 10-second audio chunks.

status
coming soon
API providers
0
downloads / mo
6.5M
license
mit

specs

TaskVoice Activity Detection, Speaker Segmentation
ArchitecturePowerset multi-class neural network with pyannote.audio
Input10 seconds mono audio at 16kHz
Output(num_frames, 7) matrix with non-speech and speaker combinations
LicenseMIT

about this model

pyannote/segmentation-3.0 is a voice activity detection model that processes 10-second mono audio chunks at 16 kHz and outputs a (num_frames, 7) powerset multi-class encoding, where the seven classes represent non-speech, individual speakers (1–3), and overlapping speaker pairs (1+2, 1+3, 2+3). This enables both voice activity detection (VAD) and overlapped speech detection directly from the same output.

Key strengths

Trained on a large, diverse combination of nine datasets: AISHELL, AliMeeting, AMI, AVA-AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse. The model is the backbone of the pyannote/speaker-diarization-3.0 pipeline, which achieves state-of-the-art diarization error rates (DER) across 12 benchmark datasets:

DatasetDER (%)
AISHELL-412.2
AliMeeting (ch1)24.5
AMI IHM18.8
AMI SDM22.7
AVA-AVD49.7
CALLHOME part 228.5
DIHARD 3 full21.4
Ego4D dev51.2
MSDWild25.4
RAMC22.2
REPERE phase27.9
VoxConverse v0.311.2

Inference is efficient: the full speaker diarization pipeline (including clustering) processes a one-hour recording in approximately 1.5 minutes on a single Nvidia Tesla V100 GPU (real-time factor ~2.5%).

Example output showing waveform, powerset multi-class encoding, and multi-label encoding

The model is released under the MIT license and is described in detail in the INTERSPEECH 2023 paper “Powerset multi-class cross entropy loss for neural speaker diarization” (Plaquet & Bredin).

best for

FAQ

What is the input format for this model?

It accepts 10 seconds of mono audio sampled at 16kHz, typically as a waveform tensor.

How does this model compare to a standard VAD?

It uses a powerset multi-class approach to also detect overlapping speakers, not just speech/non-speech.

What license is Segmentation 3.0 released under?

It is licensed under the MIT license.

Can I use this model for full recording diarization?

Not directly; it processes only 10-second chunks. Use the pyannote/speaker-diarization-3.0 pipeline for full recordings.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key and send audio data as specified in the documentation.

not yet live

We're benchmarking and onboarding Segmentation 3.0 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.