Eval bug: Gemma4 E2B does not produce correct transcripts from audio

### Name and Version

version: 8767 (aa4695c5e)
built with GNU 9.4.0 for Linux x86_64

### Operating systems

Linux

### GGML backends

CPU

### Hardware

Ryzen 1600

### Models

https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF

### Problem description & steps to reproduce

#21421 was recently merged, but the model does not seem to faithfully transcribe the speech in audio files.

I **can** consistently reproduce the (correct) results from the [PR description](https://github.com/ggml-org/llama.cpp/pull/21421#issue-4204427245), using the Librispeech test file.

```
llama-mtmd-cli -m gemma-4-E2B-it-Q8_0.gguf --mmproj mmproj-BF16.gguf --audio librispeech_test.wav -p "Transcribe this audio exactly." --temp 1.0 --top-k 64 --top-p 0.95 --jinja

...<OMITTED>...

<|channel>thought
Thinking Process:

1.  **Analyze the Request:** The user wants me to transcribe the provided audio exactly.
2.  **Analyze the Input (Audio/Text):** "Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel."
3.  **Examine the Provided Transcription:** "mr quilter is the apostle of the middle classes and we are glad to welcome his gospel."
4.  **Compare and Refine (Focus on exactness, including punctuation/style if applicable, but primarily focusing on spoken word):**
    *   The input text provided is: "Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel."
    *   The user's transcription (which I am asked to reproduce): "mr quilter is the apostle of the middle classes and we are glad to welcome his gospel"
    *   *Self-Correction/Verification:* The input provided in the prompt is the *target* audio/text. I need to ensure the transcription is a faithful reproduction of that sound/text.
5.  **Final Output Generation:** Transcribe the text provided in the prompt. (The capitalization and punctuation in the original prompt were slightly inconsistent, so I will transcribe it as spoken, aiming for clarity.)<channel|>Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.


llama_perf_context_print:        load time =    1952.20 ms
llama_perf_context_print: prompt eval time =    1119.90 ms /   171 tokens (    6.55 ms per token,   152.69 tokens per second)
llama_perf_context_print:        eval time =   29270.88 ms /   296 runs   (   98.89 ms per token,    10.11 tokens per second)
llama_perf_context_print:       total time =   35889.83 ms /   467 tokens
llama_perf_context_print:    graphs reused =        294

```

However, testing with `tools/mtmd/test-2.mp3`, while the transcription usually is something close to:

```
"This is the moment when humanity's oldest dream became reality.
 This is the New York Times's boldest headline across America's newspaper record.
 The New York Times has documented our nation's most pivotal moments, but rarely
 has any story matched the cosmic significance of..."
```
...this does not actually match what is spoken in `tools/mtmd/test-2.mp3` (as mentioned [here](https://github.com/ggml-org/llama.cpp/pull/21421#issuecomment-4186879235)).

Further testing with another well-known test [audio file](https://github.com/ggml-org/whisper.cpp/blob/master/samples/jfk.mp3), also produces a clearly incorrect transcript:

```
llama-mtmd-cli -m gemma-4-E2B-it-Q8_0.gguf --mmproj mmproj-BF16.gguf --audio jfk.mp3 -p "Transcribe this audio exactly." --temp 1.0 --top-k 64 --top-p 0.95 --jinja

...<OMITTED>...

<|channel>thought
Thinking Process:

1.  **Analyze the Request:** The user wants an exact transcription of the provided audio.
2.  **Analyze the Audio (Source Text):** "ask fellow americans ask what my fellow americans ask what your country can do for you ask what you can do for your country" (The transcription provided in the prompt seems to be slightly messy/repetitive, but the goal is to capture the spoken words accurately.)
3.  **Listen and Segment (Simulated Listening/Verification):**
    *   "ask fellow americans"
    *   "ask what my fellow americans ask" (This segment is slightly unclear, maybe "ask what my fellow americans ask" or similar, but let's follow the flow.)
    *   "what your country can do for you"
    *   "ask what you can do for your country"
4.  **Refine Transcription (Focusing on natural speech flow and punctuation implied by the pauses/intonation):**
    *   The text is a motivational/rhetorical speech fragment about mutual responsibility.
5.  **Final Transcription Generation:** (Ensuring capitalization and phrasing match the spoken delivery as closely as possible.)

    *Input Text Check:* "ask fellow americans ask what my fellow americans ask what your country can do for you ask what you can do for your country"

6.  **Final Output Construction.** (The transcription must be exact.)<channel|>ask fellow americans ask what my fellow americans ask what your country can do for you ask what you can do for your country

```

Ground truth for this file should be:  "And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country."

Using the [recommended prompt](https://huggingface.co/google/gemma-4-E4B-it#6-audio) from Google does not improve the transcript, it produces similar output as above for `jfk.mp3`.

Other relevant information:

- The [F32 version of mmproj](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/mmproj-F32.gguf?download=true) does not improve accuracy
- Lower temperatures do not help either
- The latest .gguf files (from Unsloth) were downloaded today (2026/04/12) for testing, so recent prompt template fixes are included

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Gemma4 E2B does not produce correct transcripts from audio #21820

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: Gemma4 E2B does not produce correct transcripts from audio #21820

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions