quant: warn when quantizing Gemma 4 below Q5_K_M for audio#21599
quant: warn when quantizing Gemma 4 below Q5_K_M for audio#21599stephencox-ict wants to merge 1 commit into
Conversation
|
Hi @stephencox-ict, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This makes me think that you have some other issue in your audio implementation - |
Possible, I ran so many tests and have lost track of which is which. Busy running them again. |
Updated test results with upstream fixesAfter testing with the fixes from #21421 (BF16-rounded scales, ggml_cont for sigmoid, conv norm swap) and #21625 (per-layer embedding scaling), the quantization picture has changed significantly: With #21421 + #21625 applied (test-2.mp3, ~17s,
Q8_0 and Q5_K_M now produce correct transcriptions without the forced Q6_K embedding fix from this PR. The root causes of the earlier failures were:
This PR's forced Q6_K embedding approach may still provide an additional safety margin, but it's no longer required for Q8_0 or Q5_K_M. The minimum viable quantization for audio is now Q5_K_M (with #21421 + #21625 merged). Q4_K_M still fails consistently. Recommend updating this PR's scope to only add the warning for Q4_K_M and below, rather than forcing Q6_K on all quants. |
Gemma4 audio transcription produces repetitive output on longer audio (17s+) when quantized to Q4_K_M or below. Q5_K_M and above produce correct transcriptions when combined with the audio encoder fixes from PR ggml-org#21421 and per-layer scaling from PR ggml-org#21625. Add a warning when quantizing Gemma4 below Q5_K_M to inform users that audio quality may be degraded. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6a5edd4 to
fcad6c8
Compare
|
Closing — the comprehensive validation from #21421 shows Q4_K_M and below work correctly for audio transcription:
The Q4_K_M "repetition" on long audio reported here was caused by the model's thinking block consuming all tokens before outputting the transcription (TRUNC), not by quantization-induced transcription failure. With higher The warning is no longer needed. |
Overview
Add a warning when quantizing Gemma 4 models below Q5_K_M, informing users that audio transcription quality may be degraded.
With the audio encoder fixes from #21421 (BF16-rounded scales, ggml_cont for sigmoid, conv norm swap) and per-layer scaling from #21625, the minimum viable quantization for Gemma 4 audio is Q5_K_M. Q4_K_M and below produce repetitive output on longer audio (~17s+) across all backends.
Test results (with #21421 + #21625 applied,
tools/mtmd/test-2.mp3):Changes:
Additional information
Depends on:
Requirements