whisper : add speaker diarization support#3732
whisper : add speaker diarization support#3732MoonMao42 wants to merge 5 commits intoggml-org:masterfrom
Conversation
Add speaker diarization based on ECAPA-TDNN speaker embeddings.
When enabled via --diarize, each transcription segment gets assigned a
speaker ID. The pipeline works by computing a 192-dim speaker embedding
per segment using a ported SpeechBrain ECAPA-TDNN model, then clustering
them with agglomerative hierarchical clustering.
New files:
- src/whisper-diarize.cpp/h: mel computation, ECAPA-TDNN forward pass, clustering
- src/whisper-speaker.cpp/h: GGML model loader
- models/convert-speaker-to-ggml.py: SpeechBrain model converter
Usage:
python models/convert-speaker-to-ggml.py --output models/ggml-speaker-ecapa-tdnn.bin
./whisper-cli -m models/ggml-base.en.bin \
--diarize --diarize-model models/ggml-speaker-ecapa-tdnn.bin -f input.wav
The feature is compile-gated behind WHISPER_DIARIZE and has zero overhead
when disabled. Embeddings match SpeechBrain PyTorch output (cosine distance
< 0.05).
Known limitations: ~200MB memory per encoder context, no GPU backend,
O(n^2) clustering.
Resolves: ggml-org#64
|
I recommend to create synthetic dataset with multiple speakers with text to speech models to benchmark locally and verify the method works. |
|
Good idea. I tested with real multi-speaker audio and the embeddings discriminate well (cosine distance >0.7 across speakers, <0.3 same speaker), but a reproducible TTS benchmark would be nice to have. Will look into it. |
whisper_clustering_context_create() overwrites the old pointer without freeing it first. When the same whisper_state is reused across multiple inference runs, the previous clustering context leaks. Free it before creating a new one.
- cli/server: add --diarize-model, --diarize-threshold, --diarize-speakers - unify speaker label logic across all output formats (txt/vtt/srt/csv/json/lrc/wts) - fall back to stereo diarization when no model is provided - fix memory leak in whisper_compute_mel_80, move allocs out of hot loop - thread-safe static init with std::call_once - rename hann → hamming (was actually hamming), remove dead code - dynamic ggml context sizing, WHISPER_LOG_* macros in speaker loader - fix n_channels 512 → 1024 in python converter - server: ARGV_NEXT bounds checking for all args
|
PM note: this PR is active and conceptually in review/in-progress, not backlog. The current thread is discussing a reproducible benchmark approach for multi-speaker testing, and the author said they will look into it. Next move is to produce that benchmark/repro so review can continue cleanly. — little John |
Diarization Benchmark (VoxConverse dev subset)Ran a quick benchmark against pyannote.audio 3.1 on 8 files from VoxConverse dev set (2-5 speakers, 68-664s). Apple M3, 16GB. Results
whisper.cpp: RTF=0.11 (265s for 2310s audio) ~5.2x faster, single binary, ~200MB vs ~3GB memory. Approach: 2s sliding window embeddings with 1s hop, energy-based silence filtering, agglomerative clustering (average linkage, cosine distance threshold 0.70), token-level speaker assignment with majority voting. Works well on 2-4 speaker scenarios (asxwr 2.0%, bxpwa 4.1%, akthc 5.9%). Main weakness is dense multi-speaker audio with similar voices (afjiv, 5 speakers). bkwns has a speaker with only 2.5s of speech — hard for any embedding-based approach. Eval setup: collar=0.25s, skip_overlap=False, pyannote.metrics. |
The ggml context pool could run out of space for some segment lengths where the estimate was a few MB short. Add 10% margin to the allocation.
Validation set (9 files outside original subset, 2–10 speakers)
whisper.cpp: RTF=0.12 (203s for 1762s audio) |
DER improvement explorationSpent some time trying to push DER lower: Tried: Silero VAD replacing energy-based silence filtering Not pursued (out of scope):
Current implementation works well for typical use cases (2-5 speakers, clear speech). Tightening DER further would be a separate effort. |
Add speaker diarization based on ECAPA-TDNN speaker embeddings.
When enabled via
--diarize, each transcription segment gets assigned aspeaker ID. The pipeline works by computing a 192-dim speaker embedding
per segment using a ported SpeechBrain ECAPA-TDNN model, then clustering
them with agglomerative hierarchical clustering.
New files:
src/whisper-diarize.cpp/h: mel computation, ECAPA-TDNN forward pass, clusteringsrc/whisper-speaker.cpp/h: GGML model loadermodels/convert-speaker-to-ggml.py: SpeechBrain model converterUsage:
The feature is compile-gated behind
WHISPER_DIARIZEand has zero overheadwhen disabled. Embeddings match SpeechBrain PyTorch output (cosine distance
< 0.05).
Known limitations: ~200MB memory per encoder context, no GPU backend,
O(n²) clustering.
Resolves: #64