Skip to content

quant: warn when quantizing Gemma 4 below Q5_K_M for audio#21599

Closed
stephencox-ict wants to merge 1 commit into
ggml-org:masterfrom
stephencox-ict:gemma4-quant-fix
Closed

quant: warn when quantizing Gemma 4 below Q5_K_M for audio#21599
stephencox-ict wants to merge 1 commit into
ggml-org:masterfrom
stephencox-ict:gemma4-quant-fix

Conversation

@stephencox-ict

@stephencox-ict stephencox-ict commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Overview

Add a warning when quantizing Gemma 4 models below Q5_K_M, informing users that audio transcription quality may be degraded.

With the audio encoder fixes from #21421 (BF16-rounded scales, ggml_cont for sigmoid, conv norm swap) and per-layer scaling from #21625, the minimum viable quantization for Gemma 4 audio is Q5_K_M. Q4_K_M and below produce repetitive output on longer audio (~17s+) across all backends.

Test results (with #21421 + #21625 applied, tools/mtmd/test-2.mp3):

Quant CPU CUDA (RTX 3060)
BF16 ✅ Pass ✅ Pass
Q8_0 ✅ Pass ✅ Pass
Q6_K ✅ Pass ✅ Pass
Q5_K_M ✅ Pass ✅ Pass
Q4_K_M ❌ Repetition ❌ Repetition

Changes:

  • Warning printed when quantizing Gemma4 to Q4_K_M or below
  • No forced type overrides (the previous forced Q6_K on embeddings is no longer needed)

Additional information

Depends on:

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - Claude Code was used to investigate quantization sensitivity and update this PR based on new test results. All code was manually reviewed.

@ggml-gh-bot

ggml-gh-bot Bot commented Apr 8, 2026

Copy link
Copy Markdown

Hi @stephencox-ict, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@ggerganov

Copy link
Copy Markdown
Member

Q8_0 | ❌ Repetition on long audio

This makes me think that you have some other issue in your audio implementation - Q8_0 should always work.

@stephencox-ict

Copy link
Copy Markdown
Contributor Author

Q8_0 | ❌ Repetition on long audio

This makes me think that you have some other issue in your audio implementation - Q8_0 should always work.

Possible, I ran so many tests and have lost track of which is which. Busy running them again.

@stephencox-ict stephencox-ict marked this pull request as draft April 8, 2026 11:09
@stephencox-ict

Copy link
Copy Markdown
Contributor Author

Updated test results with upstream fixes

After testing with the fixes from #21421 (BF16-rounded scales, ggml_cont for sigmoid, conv norm swap) and #21625 (per-layer embedding scaling), the quantization picture has changed significantly:

With #21421 + #21625 applied (test-2.mp3, ~17s, --temp 1.0 --top-k 64 --top-p 0.95):

Quant CPU Result
BF16 ✅ Pass
Q8_0 ✅ Pass
Q6_K ✅ Pass
Q5_K_M ✅ Pass
Q4_K_M ❌ Repetition

Q8_0 and Q5_K_M now produce correct transcriptions without the forced Q6_K embedding fix from this PR. The root causes of the earlier failures were:

  1. Non-contiguous sigmoid input causing 25 graph splits on GPU backends (fixed by ggml_cont in mtmd: add Gemma 4 audio conformer encoder support #21421)
  2. Missing BF16-rounded scale factors that compound through 35 layers (fixed in mtmd: add Gemma 4 audio conformer encoder support #21421)
  3. Missing per-layer embedding scaling for multimodal path (fixed in model: fix multimodal padding token for gemma3n/gemma4 #21625)

This PR's forced Q6_K embedding approach may still provide an additional safety margin, but it's no longer required for Q8_0 or Q5_K_M. The minimum viable quantization for audio is now Q5_K_M (with #21421 + #21625 merged). Q4_K_M still fails consistently.

Recommend updating this PR's scope to only add the warning for Q4_K_M and below, rather than forcing Q6_K on all quants.

Gemma4 audio transcription produces repetitive output on longer audio
(17s+) when quantized to Q4_K_M or below. Q5_K_M and above produce
correct transcriptions when combined with the audio encoder fixes
from PR ggml-org#21421 and per-layer scaling from PR ggml-org#21625.

Add a warning when quantizing Gemma4 below Q5_K_M to inform users
that audio quality may be degraded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@stephencox-ict stephencox-ict changed the title quant: force Q6_K minimum for Gemma4 tied embeddings quant: warn when quantizing Gemma4 below Q5_K_M Apr 9, 2026
@stephencox-ict stephencox-ict changed the title quant: warn when quantizing Gemma4 below Q5_K_M quant: warn when quantizing Gemma 4 below Q5_K_M for audio Apr 9, 2026
@stephencox-ict stephencox-ict marked this pull request as ready for review April 9, 2026 00:24
@stephencox-ict

Copy link
Copy Markdown
Contributor Author

Closing — the comprehensive validation from #21421 shows Q4_K_M and below work correctly for audio transcription:

  • E2B short audio: 14/14 PASS (all quants including Q3_K_S)
  • E2B Q4_0 long audio: PASS
  • E2B IQ4_NL long audio: PASS
  • E4B Q4_K_M: PASS (both short and long)
  • E4B Q3_K_S: PASS (both short and long)

The Q4_K_M "repetition" on long audio reported here was caused by the model's thinking block consuming all tokens before outputting the transcription (TRUNC), not by quantization-induced transcription failure. With higher -n values, Q4_K_M transcribes correctly.

The warning is no longer needed.

@stephencox-ict stephencox-ict deleted the gemma4-quant-fix branch April 10, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants