Skip to content

Eval bug: Gemma4 E2B does not produce correct transcripts from audio #21820

@dscripka

Description

@dscripka

Name and Version

version: 8767 (aa4695c)
built with GNU 9.4.0 for Linux x86_64

Operating systems

Linux

GGML backends

CPU

Hardware

Ryzen 1600

Models

https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF

Problem description & steps to reproduce

#21421 was recently merged, but the model does not seem to faithfully transcribe the speech in audio files.

I can consistently reproduce the (correct) results from the PR description, using the Librispeech test file.

llama-mtmd-cli -m gemma-4-E2B-it-Q8_0.gguf --mmproj mmproj-BF16.gguf --audio librispeech_test.wav -p "Transcribe this audio exactly." --temp 1.0 --top-k 64 --top-p 0.95 --jinja

...<OMITTED>...

<|channel>thought
Thinking Process:

1.  **Analyze the Request:** The user wants me to transcribe the provided audio exactly.
2.  **Analyze the Input (Audio/Text):** "Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel."
3.  **Examine the Provided Transcription:** "mr quilter is the apostle of the middle classes and we are glad to welcome his gospel."
4.  **Compare and Refine (Focus on exactness, including punctuation/style if applicable, but primarily focusing on spoken word):**
    *   The input text provided is: "Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel."
    *   The user's transcription (which I am asked to reproduce): "mr quilter is the apostle of the middle classes and we are glad to welcome his gospel"
    *   *Self-Correction/Verification:* The input provided in the prompt is the *target* audio/text. I need to ensure the transcription is a faithful reproduction of that sound/text.
5.  **Final Output Generation:** Transcribe the text provided in the prompt. (The capitalization and punctuation in the original prompt were slightly inconsistent, so I will transcribe it as spoken, aiming for clarity.)<channel|>Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.


llama_perf_context_print:        load time =    1952.20 ms
llama_perf_context_print: prompt eval time =    1119.90 ms /   171 tokens (    6.55 ms per token,   152.69 tokens per second)
llama_perf_context_print:        eval time =   29270.88 ms /   296 runs   (   98.89 ms per token,    10.11 tokens per second)
llama_perf_context_print:       total time =   35889.83 ms /   467 tokens
llama_perf_context_print:    graphs reused =        294

However, testing with tools/mtmd/test-2.mp3, while the transcription usually is something close to:

"This is the moment when humanity's oldest dream became reality.
 This is the New York Times's boldest headline across America's newspaper record.
 The New York Times has documented our nation's most pivotal moments, but rarely
 has any story matched the cosmic significance of..."

...this does not actually match what is spoken in tools/mtmd/test-2.mp3 (as mentioned here).

Further testing with another well-known test audio file, also produces a clearly incorrect transcript:

llama-mtmd-cli -m gemma-4-E2B-it-Q8_0.gguf --mmproj mmproj-BF16.gguf --audio jfk.mp3 -p "Transcribe this audio exactly." --temp 1.0 --top-k 64 --top-p 0.95 --jinja

...<OMITTED>...

<|channel>thought
Thinking Process:

1.  **Analyze the Request:** The user wants an exact transcription of the provided audio.
2.  **Analyze the Audio (Source Text):** "ask fellow americans ask what my fellow americans ask what your country can do for you ask what you can do for your country" (The transcription provided in the prompt seems to be slightly messy/repetitive, but the goal is to capture the spoken words accurately.)
3.  **Listen and Segment (Simulated Listening/Verification):**
    *   "ask fellow americans"
    *   "ask what my fellow americans ask" (This segment is slightly unclear, maybe "ask what my fellow americans ask" or similar, but let's follow the flow.)
    *   "what your country can do for you"
    *   "ask what you can do for your country"
4.  **Refine Transcription (Focusing on natural speech flow and punctuation implied by the pauses/intonation):**
    *   The text is a motivational/rhetorical speech fragment about mutual responsibility.
5.  **Final Transcription Generation:** (Ensuring capitalization and phrasing match the spoken delivery as closely as possible.)

    *Input Text Check:* "ask fellow americans ask what my fellow americans ask what your country can do for you ask what you can do for your country"

6.  **Final Output Construction.** (The transcription must be exact.)<channel|>ask fellow americans ask what my fellow americans ask what your country can do for you ask what you can do for your country

Ground truth for this file should be: "And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country."

Using the recommended prompt from Google does not improve the transcript, it produces similar output as above for jfk.mp3.

Other relevant information:

  • The F32 version of mmproj does not improve accuracy
  • Lower temperatures do not help either
  • The latest .gguf files (from Unsloth) were downloaded today (2026/04/12) for testing, so recent prompt template fixes are included

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions