quant: warn when quantizing Gemma 4 below Q5_K_M for audio by stephencox-ict · Pull Request #21599 · ggml-org/llama.cpp

stephencox-ict · 2026-04-08T01:36:05Z

Overview

Add a warning when quantizing Gemma 4 models below Q5_K_M, informing users that audio transcription quality may be degraded.

With the audio encoder fixes from #21421 (BF16-rounded scales, ggml_cont for sigmoid, conv norm swap) and per-layer scaling from #21625, the minimum viable quantization for Gemma 4 audio is Q5_K_M. Q4_K_M and below produce repetitive output on longer audio (~17s+) across all backends.

Test results (with #21421 + #21625 applied, tools/mtmd/test-2.mp3):

Quant	CPU	CUDA (RTX 3060)
BF16	✅ Pass	✅ Pass
Q8_0	✅ Pass	✅ Pass
Q6_K	✅ Pass	✅ Pass
Q5_K_M	✅ Pass	✅ Pass
Q4_K_M	❌ Repetition	❌ Repetition

Changes:

Warning printed when quantizing Gemma4 to Q4_K_M or below
No forced type overrides (the previous forced Q6_K on embeddings is no longer needed)

Additional information

Depends on:

mtmd: add Gemma 4 audio conformer encoder support #21421 (Gemma4 audio encoder)
model: fix multimodal padding token for gemma3n/gemma4 #21625 (per-layer embedding scale for multimodal path)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - Claude Code was used to investigate quantization sensitivity and update this PR based on new test results. All code was manually reviewed.

ggml-gh-bot · 2026-04-08T01:40:16Z

Hi @stephencox-ict, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.
AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

ggerganov · 2026-04-08T07:59:35Z

Q8_0 | ❌ Repetition on long audio

This makes me think that you have some other issue in your audio implementation - Q8_0 should always work.

stephencox-ict · 2026-04-08T11:09:05Z

Q8_0 | ❌ Repetition on long audio

This makes me think that you have some other issue in your audio implementation - Q8_0 should always work.

Possible, I ran so many tests and have lost track of which is which. Busy running them again.

stephencox-ict · 2026-04-09T00:15:36Z

Updated test results with upstream fixes

After testing with the fixes from #21421 (BF16-rounded scales, ggml_cont for sigmoid, conv norm swap) and #21625 (per-layer embedding scaling), the quantization picture has changed significantly:

With #21421 + #21625 applied (test-2.mp3, ~17s, --temp 1.0 --top-k 64 --top-p 0.95):

Quant	CPU Result
BF16	✅ Pass
Q8_0	✅ Pass
Q6_K	✅ Pass
Q5_K_M	✅ Pass
Q4_K_M	❌ Repetition

Q8_0 and Q5_K_M now produce correct transcriptions without the forced Q6_K embedding fix from this PR. The root causes of the earlier failures were:

Non-contiguous sigmoid input causing 25 graph splits on GPU backends (fixed by ggml_cont in mtmd: add Gemma 4 audio conformer encoder support #21421)
Missing BF16-rounded scale factors that compound through 35 layers (fixed in mtmd: add Gemma 4 audio conformer encoder support #21421)
Missing per-layer embedding scaling for multimodal path (fixed in model: fix multimodal padding token for gemma3n/gemma4 #21625)

This PR's forced Q6_K embedding approach may still provide an additional safety margin, but it's no longer required for Q8_0 or Q5_K_M. The minimum viable quantization for audio is now Q5_K_M (with #21421 + #21625 merged). Q4_K_M still fails consistently.

Recommend updating this PR's scope to only add the warning for Q4_K_M and below, rather than forcing Q6_K on all quants.

Gemma4 audio transcription produces repetitive output on longer audio (17s+) when quantized to Q4_K_M or below. Q5_K_M and above produce correct transcriptions when combined with the audio encoder fixes from PR ggml-org#21421 and per-layer scaling from PR ggml-org#21625. Add a warning when quantizing Gemma4 below Q5_K_M to inform users that audio quality may be degraded. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

stephencox-ict · 2026-04-10T09:24:28Z

Closing — the comprehensive validation from #21421 shows Q4_K_M and below work correctly for audio transcription:

E2B short audio: 14/14 PASS (all quants including Q3_K_S)
E2B Q4_0 long audio: PASS
E2B IQ4_NL long audio: PASS
E4B Q4_K_M: PASS (both short and long)
E4B Q3_K_S: PASS (both short and long)

The Q4_K_M "repetition" on long audio reported here was caused by the model's thinking block consuming all tokens before outputting the transcription (TRUNC), not by quantization-induced transcription failure. With higher -n values, Q4_K_M transcribes correctly.

The warning is no longer needed.

stephencox-ict requested a review from ggerganov as a code owner April 8, 2026 01:36

stephencox-ict mentioned this pull request Apr 8, 2026

mtmd: add Gemma 4 audio conformer encoder support #21421

Merged

10 tasks

stephencox-ict marked this pull request as draft April 8, 2026 11:09

stephencox-ict force-pushed the gemma4-quant-fix branch from 6a5edd4 to fcad6c8 Compare April 9, 2026 00:20

stephencox-ict changed the title ~~quant: force Q6_K minimum for Gemma4 tied embeddings~~ quant: warn when quantizing Gemma4 below Q5_K_M Apr 9, 2026

stephencox-ict changed the title ~~quant: warn when quantizing Gemma4 below Q5_K_M~~ quant: warn when quantizing Gemma 4 below Q5_K_M for audio Apr 9, 2026

stephencox-ict marked this pull request as ready for review April 9, 2026 00:24

stephencox-ict closed this Apr 10, 2026

stephencox-ict deleted the gemma4-quant-fix branch April 10, 2026 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quant: warn when quantizing Gemma 4 below Q5_K_M for audio#21599

quant: warn when quantizing Gemma 4 below Q5_K_M for audio#21599
stephencox-ict wants to merge 1 commit into
ggml-org:masterfrom
stephencox-ict:gemma4-quant-fix

stephencox-ict commented Apr 8, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented Apr 8, 2026

Uh oh!

ggerganov commented Apr 8, 2026

Uh oh!

stephencox-ict commented Apr 8, 2026

Uh oh!

stephencox-ict commented Apr 9, 2026

Uh oh!

stephencox-ict commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stephencox-ict commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented Apr 8, 2026

Uh oh!

ggerganov commented Apr 8, 2026

Uh oh!

stephencox-ict commented Apr 8, 2026

Uh oh!

stephencox-ict commented Apr 9, 2026

Updated test results with upstream fixes

Uh oh!

stephencox-ict commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stephencox-ict commented Apr 8, 2026 •

edited

Loading