Skip to content

whisper : add speaker diarization support#3732

Open
MoonMao42 wants to merge 5 commits intoggml-org:masterfrom
MoonMao42:speaker-diarization
Open

whisper : add speaker diarization support#3732
MoonMao42 wants to merge 5 commits intoggml-org:masterfrom
MoonMao42:speaker-diarization

Conversation

@MoonMao42
Copy link
Copy Markdown

Add speaker diarization based on ECAPA-TDNN speaker embeddings.

When enabled via --diarize, each transcription segment gets assigned a
speaker ID. The pipeline works by computing a 192-dim speaker embedding
per segment using a ported SpeechBrain ECAPA-TDNN model, then clustering
them with agglomerative hierarchical clustering.

New files:

  • src/whisper-diarize.cpp/h: mel computation, ECAPA-TDNN forward pass, clustering
  • src/whisper-speaker.cpp/h: GGML model loader
  • models/convert-speaker-to-ggml.py: SpeechBrain model converter

Usage:

python models/convert-speaker-to-ggml.py --output models/ggml-speaker-ecapa-tdnn.bin
./whisper-cli -m models/ggml-base.en.bin \
  --diarize --diarize-model models/ggml-speaker-ecapa-tdnn.bin -f input.wav

The feature is compile-gated behind WHISPER_DIARIZE and has zero overhead
when disabled. Embeddings match SpeechBrain PyTorch output (cosine distance
< 0.05).

Known limitations: ~200MB memory per encoder context, no GPU backend,
O(n²) clustering.

Resolves: #64

Add speaker diarization based on ECAPA-TDNN speaker embeddings.

When enabled via --diarize, each transcription segment gets assigned a
speaker ID. The pipeline works by computing a 192-dim speaker embedding
per segment using a ported SpeechBrain ECAPA-TDNN model, then clustering
them with agglomerative hierarchical clustering.

New files:
- src/whisper-diarize.cpp/h: mel computation, ECAPA-TDNN forward pass, clustering
- src/whisper-speaker.cpp/h: GGML model loader
- models/convert-speaker-to-ggml.py: SpeechBrain model converter

Usage:
  python models/convert-speaker-to-ggml.py --output models/ggml-speaker-ecapa-tdnn.bin
  ./whisper-cli -m models/ggml-base.en.bin \
    --diarize --diarize-model models/ggml-speaker-ecapa-tdnn.bin -f input.wav

The feature is compile-gated behind WHISPER_DIARIZE and has zero overhead
when disabled. Embeddings match SpeechBrain PyTorch output (cosine distance
< 0.05).

Known limitations: ~200MB memory per encoder context, no GPU backend,
O(n^2) clustering.

Resolves: ggml-org#64
@thewh1teagle
Copy link
Copy Markdown
Contributor

I recommend to create synthetic dataset with multiple speakers with text to speech models to benchmark locally and verify the method works.

@MoonMao42
Copy link
Copy Markdown
Author

Good idea. I tested with real multi-speaker audio and the embeddings discriminate well (cosine distance >0.7 across speakers, <0.3 same speaker), but a reproducible TTS benchmark would be nice to have. Will look into it.

whisper_clustering_context_create() overwrites the old pointer without
freeing it first. When the same whisper_state is reused across multiple
inference runs, the previous clustering context leaks. Free it before
creating a new one.
- cli/server: add --diarize-model, --diarize-threshold, --diarize-speakers
- unify speaker label logic across all output formats (txt/vtt/srt/csv/json/lrc/wts)
- fall back to stereo diarization when no model is provided
- fix memory leak in whisper_compute_mel_80, move allocs out of hot loop
- thread-safe static init with std::call_once
- rename hann → hamming (was actually hamming), remove dead code
- dynamic ggml context sizing, WHISPER_LOG_* macros in speaker loader
- fix n_channels 512 → 1024 in python converter
- server: ARGV_NEXT bounds checking for all args
@MoonMao42 MoonMao42 marked this pull request as ready for review April 3, 2026 16:16
@HDJohnbot
Copy link
Copy Markdown

PM note: this PR is active and conceptually in review/in-progress, not backlog. The current thread is discussing a reproducible benchmark approach for multi-speaker testing, and the author said they will look into it. Next move is to produce that benchmark/repro so review can continue cleanly. — little John

@MoonMao42
Copy link
Copy Markdown
Author

MoonMao42 commented Apr 3, 2026

Diarization Benchmark (VoxConverse dev subset)

Ran a quick benchmark against pyannote.audio 3.1 on 8 files from VoxConverse dev set (2-5 speakers, 68-664s). Apple M3, 16GB.

Results

File Spks Dur W.cpp DER W.cpp t Pyan DER Pyan t
akthc 2 115s 5.9% 14.1s 2.9% 60.3s
bkwns 2 68s 37.6% 7.1s 0.1% 34.7s
ampme 3 148s 10.3% 15.7s 7.6% 93.2s
asxwr 3 238s 2.0% 26.2s 0.8% 144.5s
ahnss 4 664s 10.9% 80.7s 3.2% 406.2s
afjiv 5 151s 50.7% 14.6s 5.5% 89.5s
bauzd 5 500s 13.4% 59.0s 6.1% 305.5s
bxpwa 5 426s 4.1% 47.4s 2.1% 256.2s
AVG 2310s 16.9% 33.1s 3.5% 173.8s

whisper.cpp: RTF=0.11 (265s for 2310s audio)
pyannote 3.1: RTF=0.60 (1390s for 2310s audio)

~5.2x faster, single binary, ~200MB vs ~3GB memory.

Approach: 2s sliding window embeddings with 1s hop, energy-based silence filtering, agglomerative clustering (average linkage, cosine distance threshold 0.70), token-level speaker assignment with majority voting.

Works well on 2-4 speaker scenarios (asxwr 2.0%, bxpwa 4.1%, akthc 5.9%). Main weakness is dense multi-speaker audio with similar voices (afjiv, 5 speakers). bkwns has a speaker with only 2.5s of speech — hard for any embedding-based approach.

Eval setup: collar=0.25s, skip_overlap=False, pyannote.metrics.

The ggml context pool could run out of space for some segment
lengths where the estimate was a few MB short. Add 10% margin
to the allocation.
@MoonMao42
Copy link
Copy Markdown
Author

Validation set (9 files outside original subset, 2–10 speakers)

File Spks Dur W.cpp DER W.cpp t Pyan DER Pyan t
cobal 2 76s 0.0% 8.1s 0.1% 35.4s
aufkn 3 181s 8.8% 18.8s 6.0% 91.0s
dhorc 4 303s 6.6% 30.8s 0.8% 176.2s
ccokr 5 201s 18.7% 27.3s 6.9% 128.6s
edixl 6 312s 14.1% 35.9s 1.0% 178.6s
fsaal 7 198s 5.0% 24.3s 4.5% 125.5s
eziem 8 178s 19.2% 19.2s 7.0% 613.6s
sqkup 9 131s 38.4% 17.7s 30.8% 75.2s
xmfzh 10 181s 20.4% 20.7s 3.8% 117.7s
AVG 1762s 14.6% 22.5s 6.8% 171.3s

whisper.cpp: RTF=0.12 (203s for 1762s audio)
pyannote 3.1: RTF=0.87 (1542s for 1762s audio)

@MoonMao42
Copy link
Copy Markdown
Author

DER improvement exploration

Spent some time trying to push DER lower:

Tried: Silero VAD replacing energy-based silence filtering
Swapped the RMS energy threshold (< 0.01) in the diarization window loop with Silero VAD speech probabilities. No measurable difference — whisper's ASR segments are already speech, so the energy check was rarely triggering anyway. Added ~3s overhead per file for model loading. Reverted.

Not pursued (out of scope):

  • Spectral clustering (replacing AHC) — papers show ~20-40% relative DER reduction, but needs an eigensolver in C from scratch, 300+ lines
  • Overlap-aware diarization — requires a separate overlap detection model
  • Multi-scale embedding windows (1s/2s/3s averaging) — likely marginal, doesn't fix the clustering bottleneck
  • VBx post-processing

Current implementation works well for typical use cases (2-5 speakers, clear speech). Tightening DER further would be a separate effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

whisper : mark speakers/voices (diarization)

3 participants