Fix granite speech model inference by applying embedding scale when deepstack is not used by arnu515 · Pull Request #24357 · ggml-org/llama.cpp

arnu515 · 2026-06-09T11:34:21Z

Overview

Granite speech inference stopped working as a result of #23545 (found via git bisect). It would just output a bunch of asterisks indefinitely. The culprit was an if statement in llama-graph.cpp that didn't scale raw embeddings, which was correct for granite vision (since it has deepstack layers), but not for granite speech.

This commit fixes that by adding a guard for deepstack layers to that if statement. This fixes granite speech without affecting granite vision.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: No AI used

ngxson · 2026-06-09T11:50:09Z

@gabe-l-hart could you check if this is the root cause of the problem related to mtmd test?

gabe-l-hart

Thanks for catching this! This code path should probably be refactored into individual model graph builders so we don't run into this kind of cross-cutting breakage. It was originally added for granite and has been reused by all derived granite models, but the hparam.f_embedding_scale param is not model-specific, so it's likely that it's being used by other models which makes it hard to detect which others would break if this moved into the model builders.

One small suggestion on the comment to make it more generic since this still lives in the generic code path, but not a strong opinion, so no need to block merging.

@ngxson Confirmed that this is the root cause of the failing test for granite-vision-3.2. I'll go ahead and close #24323

gabe-l-hart · 2026-06-09T15:24:42Z

    // For Granite architecture
-    // NOTE: Only apply scale to token inputs. Raw embeddings are assumed to be
-    //  multimodal inputs that should not be scaled.
-    if (ubatch.token && hparams.f_embedding_scale != 0.0f) {


Suggested change

// NOTE: For deepstack models, only apply scale to token inputs (ie text-only input).

// Raw embeddings are assumed to be multimodal inputs that should not be scaled.

if (hparams.f_embedding_scale != 0.0f && (ubatch.token || hparams.n_deepstack_layers == 0)) {

gabe-l-hart · 2026-06-09T15:40:04Z

I'm going to do a quick sanity check on 4.1 vision to make sure this doesn't have issues there, but it shouldn't.

gabe-l-hart · 2026-06-09T15:44:32Z

All good with 4.1 vision too

$ ./build/bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-vision-4.1-4b/granite-4B-vision-4.1-BF16.gguf --mmproj ~/models/ibm-granite/granite-vision-4.1-4b/mmproj-granite-vision-4b-4.1-BF16.gguf --image ./tools/mtmd/test-1.jpeg --temp 0 -n 128 --flash-attn on -p 'what is the publisher name of the newspaper?' 
0.00.038.624 I common_init_result: fitting params to device memory ...
0.00.038.628 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.02.048.601 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.02.135.834 I mtmd_cli_context: chat template example:
<|start_of_role|>system<|end_of_role|>You are a helpful assistant<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>Hi there<|end_of_text|>
<|start_of_role|>user<|end_of_role|>How are you?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
0.11.508.194 I main: loading model: /Users/ghart/models/ibm-granite/granite-vision-4.1-4b/granite-4B-vision-4.1-BF16.gguf
0.11.508.201 W WARN: This is an experimental CLI for testing multimodal capability.
0.11.508.201 W       For normal use cases, please use the standard llama-cli

new york times


0.13.858.244 W ~llama_context:       MTL0 compute buffer size of 381.5886 MiB, does not match expectation of 226.0117 MiB

Also confirmed working with granite-speech-4.1-2b:

Without fix

$ ./build/bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-speech-4.1-2b-GGUF/granite-speech-4.1-2b-Q4_K_M.gguf --mmproj ~/models/ibm-granite/granite-speech-4.1-2b-GGUF/mmproj-model-f16.gguf  --temp 0 -n 128 --flash-attn on --audio ~/models/ibm-granite/granite-speech-4.1-2b/multilingual_sample.wav -p "transcribe this audio to a text format" --jinja
0.00.039.914 I common_init_result: fitting params to device memory ...
0.00.039.919 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.01.082.039 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.01.113.404 I mtmd_cli_context: chat template example:
USER: You are a helpful assistant
Hello
 ASSISTANT:Hi thereUSER: How are you?
 ASSISTANT:
0.10.994.531 W init_audio: audio input is in experimental stage and may have reduced quality:
    https://github.com/ggml-org/llama.cpp/discussions/13759
0.10.994.776 I main: loading model: /Users/ghart/models/ibm-granite/granite-speech-4.1-2b-GGUF/granite-speech-4.1-2b-Q4_K_M.gguf
0.10.994.780 W WARN: This is an experimental CLI for testing multimodal capability.
0.10.994.780 W       For normal use cases, please use the standard llama-cli

総合的研究成果は、国際的な研究成果を発表するために必要です。

With fix

./build/bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-speech-4.1-2b-GGUF/granite-speech-4.1-2b-Q4_K_M.gguf --mmproj ~/models/ibm-granite/granite-speech-4.1-2b-GGUF/mmproj-model-f16.gguf  --temp 0 -n 128 --flash-attn on --audio ~/models/ibm-granite/granite-speech-4.1-2b/multilingual_sample.wav -p "transcribe this audio to a text format" --jinja
0.00.033.746 I common_init_result: fitting params to device memory ...
0.00.033.748 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.01.091.898 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.01.124.713 I mtmd_cli_context: chat template example:
USER: You are a helpful assistant
Hello
 ASSISTANT:Hi thereUSER: How are you?
 ASSISTANT:
0.11.259.268 W init_audio: audio input is in experimental stage and may have reduced quality:
    https://github.com/ggml-org/llama.cpp/discussions/13759
0.11.259.528 I main: loading model: /Users/ghart/models/ibm-granite/granite-speech-4.1-2b-GGUF/granite-speech-4.1-2b-Q4_K_M.gguf
0.11.259.535 W WARN: This is an experimental CLI for testing multimodal capability.
0.11.259.535 W       For normal use cases, please use the standard llama-cli

for timothy was a spoiled cat, and he allowed no one to interfere. everybody waited upon him, moving their chairs even, for he was monarch of the hearth. "dinarzade, la nuit suivante appela sa soeur quand il en fut temps. 'si vous ne dormez pas, ma soeur,' lui dit elle, 'je vous prie en attendant le jour qui paraîtra bientôt de continuer le conte du pêcheur.

ngxson · 2026-06-09T15:54:02Z

thanks! confirm that all tests arr passed now:

(I will need to push a commit to remove the non-existent Hunyuan-VL model from the tests)

[vision] OK:   ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   ggml-org/LFM2-VL-450M-GGUF:Q8_0
[vision] OK:   ggml-org/granite-docling-258M-GGUF:Q8_0
[vision] OK:   ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0
[vision] OK:   ggml-org/DeepSeek-OCR-GGUF:Q8_0
[vision] OK:   ggml-org/dots.ocr-GGUF:Q8_0
[vision] OK:   ggml-org/HunyuanOCR-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-4-E2B-it-GGUF:Q8_0
[audio]  OK:   ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[audio]  OK:   ggml-org/LFM2-Audio-1.5B-GGUF:Q8_0
[audio]  OK:   ggml-org/gemma-4-E2B-it-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen3-ASR-0.6B-GGUF:Q8_0

ngxson · 2026-06-09T17:32:48Z

need an approval @ggml-org/maintainers , thanks

* upstream/HEAD: (329 commits) vendor : update LibreSSL to 4.3.2 (ggml-org#24397) Remove padding and multiple D2D copies for MTP (ggml-org#24086) chat: fix LFM2/LFM2.5 ignoring json_schema (ggml-org#24377) CUDA: Fix ssm_scan_f32 data-races (ggml-org#24360) ci : bump komac version (ggml-org#24396) speculative : fix "ngram-map-k4v" name in logging (ggml-org#24253) webui: implement pinned conversations support (ggml-org#21387) graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (ggml-org#24357) ci : fix windows release (ggml-org#24369) ui: add opt-in run_javascript frontend tool (ggml-org#24244) mtmd: build_vit batching (ggml-org#24352) vulkan: reduce iq1 shared memory usage for mul_mm (ggml-org#24287) vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (ggml-org#24123) ui: Fix excessive style recalculation on hover (ggml-org#24243) mtmd: refactor video subproc handling (ggml-org#24316) server: log prompts to directory (ggml-org#22031) ui: fix mobile chat form overflow and bust stale bundle cache (ggml-org#24158) ggml : add GGML_OP_COL2IM_1D (ggml-org#24206) server : do not clear slots without unified KV cache (ggml-org#24190) models : fix plamo2 attention_key/value_length regression (ggml-org#24317) ...

llama-graph : apply embedding scale when deepstack is not used

c4b22de

arnu515 requested a review from CISC as a code owner June 9, 2026 11:34

gabe-l-hart approved these changes Jun 9, 2026

View reviewed changes

gabe-l-hart mentioned this pull request Jun 9, 2026

fix: Add --jinja flag to granite-vision-3.2 test #24323

Closed

CISC approved these changes Jun 9, 2026

View reviewed changes

nits: remove non-existant hunyuan-vl from the tests

4584f94

ngxson requested a review from a team as a code owner June 9, 2026 15:54

ngxson requested a review from CISC June 9, 2026 15:55

github-actions Bot added the examples label Jun 9, 2026

apply suggestion from @gabe-l-hart

c28b938

ServeurpersoCom approved these changes Jun 9, 2026

View reviewed changes

ngxson merged commit d73cd07 into ggml-org:master Jun 9, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix granite speech model inference by applying embedding scale when deepstack is not used#24357

Fix granite speech model inference by applying embedding scale when deepstack is not used#24357
ngxson merged 3 commits into
ggml-org:masterfrom
arnu515:master

arnu515 commented Jun 9, 2026

Uh oh!

ngxson commented Jun 9, 2026

Uh oh!

gabe-l-hart left a comment

Uh oh!

gabe-l-hart Jun 9, 2026

Uh oh!

gabe-l-hart commented Jun 9, 2026

Uh oh!

gabe-l-hart commented Jun 9, 2026

Uh oh!

ngxson commented Jun 9, 2026 •

edited

Loading

Uh oh!

ngxson commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

+    // NOTE: For deepstack models, only apply scale to token inputs (ie text-only input).
+    //  Raw embeddings are assumed to be multimodal inputs that should not be scaled.
+    if (hparams.f_embedding_scale != 0.0f && (ubatch.token || hparams.n_deepstack_layers == 0)) {

Conversation

arnu515 commented Jun 9, 2026

Overview

Requirements

Uh oh!

ngxson commented Jun 9, 2026

Uh oh!

gabe-l-hart left a comment

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart commented Jun 9, 2026

Uh oh!

gabe-l-hart commented Jun 9, 2026

Uh oh!

ngxson commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ngxson commented Jun 9, 2026 •

edited

Loading