Skip to content

Fix granite speech model inference by applying embedding scale when deepstack is not used#24357

Merged
ngxson merged 3 commits into
ggml-org:masterfrom
arnu515:master
Jun 9, 2026
Merged

Fix granite speech model inference by applying embedding scale when deepstack is not used#24357
ngxson merged 3 commits into
ggml-org:masterfrom
arnu515:master

Conversation

@arnu515

@arnu515 arnu515 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Overview

Granite speech inference stopped working as a result of #23545 (found via git bisect). It would just output a bunch of asterisks indefinitely. The culprit was an if statement in llama-graph.cpp that didn't scale raw embeddings, which was correct for granite vision (since it has deepstack layers), but not for granite speech.

This commit fixes that by adding a guard for deepstack layers to that if statement. This fixes granite speech without affecting granite vision.

Requirements

@arnu515 arnu515 requested a review from CISC as a code owner June 9, 2026 11:34
@ngxson

ngxson commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@gabe-l-hart could you check if this is the root cause of the problem related to mtmd test?

@gabe-l-hart gabe-l-hart left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! This code path should probably be refactored into individual model graph builders so we don't run into this kind of cross-cutting breakage. It was originally added for granite and has been reused by all derived granite models, but the hparam.f_embedding_scale param is not model-specific, so it's likely that it's being used by other models which makes it hard to detect which others would break if this moved into the model builders.

One small suggestion on the comment to make it more generic since this still lives in the generic code path, but not a strong opinion, so no need to block merging.

@ngxson Confirmed that this is the root cause of the failing test for granite-vision-3.2. I'll go ahead and close #24323

Comment thread src/llama-graph.cpp Outdated
Comment on lines -1876 to -1878
// For Granite architecture
// NOTE: Only apply scale to token inputs. Raw embeddings are assumed to be
// multimodal inputs that should not be scaled.
if (ubatch.token && hparams.f_embedding_scale != 0.0f) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NOTE: For deepstack models, only apply scale to token inputs (ie text-only input).
// Raw embeddings are assumed to be multimodal inputs that should not be scaled.
if (hparams.f_embedding_scale != 0.0f && (ubatch.token || hparams.n_deepstack_layers == 0)) {

@gabe-l-hart

Copy link
Copy Markdown
Collaborator

I'm going to do a quick sanity check on 4.1 vision to make sure this doesn't have issues there, but it shouldn't.

@gabe-l-hart

Copy link
Copy Markdown
Collaborator

All good with 4.1 vision too

$ ./build/bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-vision-4.1-4b/granite-4B-vision-4.1-BF16.gguf --mmproj ~/models/ibm-granite/granite-vision-4.1-4b/mmproj-granite-vision-4b-4.1-BF16.gguf --image ./tools/mtmd/test-1.jpeg --temp 0 -n 128 --flash-attn on -p 'what is the publisher name of the newspaper?' 
0.00.038.624 I common_init_result: fitting params to device memory ...
0.00.038.628 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.02.048.601 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.02.135.834 I mtmd_cli_context: chat template example:
<|start_of_role|>system<|end_of_role|>You are a helpful assistant<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>Hi there<|end_of_text|>
<|start_of_role|>user<|end_of_role|>How are you?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
0.11.508.194 I main: loading model: /Users/ghart/models/ibm-granite/granite-vision-4.1-4b/granite-4B-vision-4.1-BF16.gguf
0.11.508.201 W WARN: This is an experimental CLI for testing multimodal capability.
0.11.508.201 W       For normal use cases, please use the standard llama-cli

new york times


0.13.858.244 W ~llama_context:       MTL0 compute buffer size of 381.5886 MiB, does not match expectation of 226.0117 MiB

Also confirmed working with granite-speech-4.1-2b:

Without fix

$ ./build/bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-speech-4.1-2b-GGUF/granite-speech-4.1-2b-Q4_K_M.gguf --mmproj ~/models/ibm-granite/granite-speech-4.1-2b-GGUF/mmproj-model-f16.gguf  --temp 0 -n 128 --flash-attn on --audio ~/models/ibm-granite/granite-speech-4.1-2b/multilingual_sample.wav -p "transcribe this audio to a text format" --jinja
0.00.039.914 I common_init_result: fitting params to device memory ...
0.00.039.919 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.01.082.039 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.01.113.404 I mtmd_cli_context: chat template example:
USER: You are a helpful assistant
Hello
 ASSISTANT:Hi thereUSER: How are you?
 ASSISTANT:
0.10.994.531 W init_audio: audio input is in experimental stage and may have reduced quality:
    https://github.com/ggml-org/llama.cpp/discussions/13759
0.10.994.776 I main: loading model: /Users/ghart/models/ibm-granite/granite-speech-4.1-2b-GGUF/granite-speech-4.1-2b-Q4_K_M.gguf
0.10.994.780 W WARN: This is an experimental CLI for testing multimodal capability.
0.10.994.780 W       For normal use cases, please use the standard llama-cli

総合的研究成果は、国際的な研究成果を発表するために必要です。

With fix

./build/bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-speech-4.1-2b-GGUF/granite-speech-4.1-2b-Q4_K_M.gguf --mmproj ~/models/ibm-granite/granite-speech-4.1-2b-GGUF/mmproj-model-f16.gguf  --temp 0 -n 128 --flash-attn on --audio ~/models/ibm-granite/granite-speech-4.1-2b/multilingual_sample.wav -p "transcribe this audio to a text format" --jinja
0.00.033.746 I common_init_result: fitting params to device memory ...
0.00.033.748 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.01.091.898 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.01.124.713 I mtmd_cli_context: chat template example:
USER: You are a helpful assistant
Hello
 ASSISTANT:Hi thereUSER: How are you?
 ASSISTANT:
0.11.259.268 W init_audio: audio input is in experimental stage and may have reduced quality:
    https://github.com/ggml-org/llama.cpp/discussions/13759
0.11.259.528 I main: loading model: /Users/ghart/models/ibm-granite/granite-speech-4.1-2b-GGUF/granite-speech-4.1-2b-Q4_K_M.gguf
0.11.259.535 W WARN: This is an experimental CLI for testing multimodal capability.
0.11.259.535 W       For normal use cases, please use the standard llama-cli

for timothy was a spoiled cat, and he allowed no one to interfere. everybody waited upon him, moving their chairs even, for he was monarch of the hearth. "dinarzade, la nuit suivante appela sa soeur quand il en fut temps. 'si vous ne dormez pas, ma soeur,' lui dit elle, 'je vous prie en attendant le jour qui paraîtra bientôt de continuer le conte du pêcheur.

@ngxson

ngxson commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

thanks! confirm that all tests arr passed now:

(I will need to push a commit to remove the non-existent Hunyuan-VL model from the tests)

[vision] OK:   ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   ggml-org/LFM2-VL-450M-GGUF:Q8_0
[vision] OK:   ggml-org/granite-docling-258M-GGUF:Q8_0
[vision] OK:   ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0
[vision] OK:   ggml-org/DeepSeek-OCR-GGUF:Q8_0
[vision] OK:   ggml-org/dots.ocr-GGUF:Q8_0
[vision] OK:   ggml-org/HunyuanOCR-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-4-E2B-it-GGUF:Q8_0
[audio]  OK:   ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[audio]  OK:   ggml-org/LFM2-Audio-1.5B-GGUF:Q8_0
[audio]  OK:   ggml-org/gemma-4-E2B-it-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen3-ASR-0.6B-GGUF:Q8_0

@ngxson ngxson requested a review from a team as a code owner June 9, 2026 15:54
@ngxson ngxson requested a review from CISC June 9, 2026 15:55
@ngxson

ngxson commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

need an approval @ggml-org/maintainers , thanks

@ngxson ngxson merged commit d73cd07 into ggml-org:master Jun 9, 2026
25 checks passed
Jcfunk added a commit to Jcfunk/llama.cpp that referenced this pull request Jun 11, 2026
* upstream/HEAD: (329 commits)
  vendor : update LibreSSL to 4.3.2 (ggml-org#24397)
  Remove padding and multiple D2D copies for MTP (ggml-org#24086)
  chat: fix LFM2/LFM2.5 ignoring json_schema (ggml-org#24377)
  CUDA: Fix ssm_scan_f32 data-races (ggml-org#24360)
  ci : bump komac version (ggml-org#24396)
  speculative : fix "ngram-map-k4v" name in logging (ggml-org#24253)
  webui: implement pinned conversations support (ggml-org#21387)
  graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (ggml-org#24357)
  ci : fix windows release (ggml-org#24369)
  ui: add opt-in run_javascript frontend tool (ggml-org#24244)
  mtmd: build_vit batching (ggml-org#24352)
  vulkan: reduce iq1 shared memory usage for mul_mm (ggml-org#24287)
  vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (ggml-org#24123)
  ui: Fix excessive style recalculation on hover (ggml-org#24243)
  mtmd: refactor video subproc handling (ggml-org#24316)
  server: log prompts to directory (ggml-org#22031)
  ui: fix mobile chat form overflow and bust stale bundle cache (ggml-org#24158)
  ggml : add GGML_OP_COL2IM_1D (ggml-org#24206)
  server : do not clear slots without unified KV cache (ggml-org#24190)
  models : fix plamo2 attention_key/value_length regression (ggml-org#24317)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants