Skip to content

llama: prefix MTP assistant tensors with 'mtp.' on load allowing use of -ot 'mtp..*=CUDA0' flag#7

Merged
Ooooze merged 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/mtp-assistant-tensor-prefix
May 12, 2026
Merged

llama: prefix MTP assistant tensors with 'mtp.' on load allowing use of -ot 'mtp..*=CUDA0' flag#7
Ooooze merged 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/mtp-assistant-tensor-prefix

Conversation

@sujitvasanth

@sujitvasanth sujitvasanth commented May 11, 2026

Copy link
Copy Markdown

Overview

When the Gemma 4 assistant GGUF is loaded via llama_model_load_mtp_from_file, its block tensors (blk.0-3.*), token_embd, output_norm and rope_freqs share identical names with the target model's tensors. This makes it impossible to uniquely target MTP assistant tensors via -ot rules for GPU placement.

Fix: after loading the assistant from file, rename all tensors not already prefixed with 'mtp.' to 'mtp.<original_name>'. This is done purely in-memory on the tensors_by_name vector and the ggml_tensor name field — the GGUF file and published arch names are unchanged.

After this change, all MTP assistant tensors are addressable as mtp.blk.N.*, mtp.token_embd.weight, mtp.output_norm.weight etc, and can be pinned with:

-ot 'mtp..*=CUDA0'
This leads to speedups on multi GPU systems

This is important on dual GPU systems as splitting MTP head slows down the inference.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, co-wrote with Claude, manually edited the code and I have reviewed the code, compiled and tested and all works on ubuntu 20.04, giving further speedup.

When the Gemma 4 assistant GGUF is loaded via llama_model_load_mtp_from_file,
its block tensors (blk.0-3.*), token_embd, output_norm and rope_freqs share
identical names with the target model's tensors. This makes it impossible to
uniquely target MTP assistant tensors via -ot rules for GPU placement.

Fix: after loading the assistant into aux, rename all tensors not already
prefixed with 'mtp.' to 'mtp.<original_name>'. This is done purely in-memory
on the tensors_by_name vector and the ggml_tensor name field — the GGUF file
and published arch names are unchanged.

After this change, all MTP assistant tensors are addressable as mtp.blk.N.*,
mtp.token_embd.weight, mtp.output_norm.weight etc, and can be pinned with:

  -ot 'mtp\..*=CUDA0'
@sujitvasanth sujitvasanth changed the title llama: prefix MTP assistant tensors with 'mtp.' on load llama: prefix MTP assistant tensors with 'mtp.' on load allowing useof -ot 'mtp..*=CUDA0' flag May 11, 2026
@sujitvasanth sujitvasanth changed the title llama: prefix MTP assistant tensors with 'mtp.' on load allowing useof -ot 'mtp..*=CUDA0' flag llama: prefix MTP assistant tensors with 'mtp.' on load allowing use of -ot 'mtp..*=CUDA0' flag May 11, 2026
@Ooooze Ooooze merged commit e381dc9 into AtomicBot-ai:feature/turboquant-kv-cache May 12, 2026
1 check passed
Ooooze added a commit that referenced this pull request May 12, 2026
Brings in Gemma 4 + TurboQuant KV cache fixes:
- fix/turbo-rope-shift-gemma4 (PR #10)
- fix/iswa-get-can-shift-gemma4 (PR #9)
- fix/mtp-assistant-tensor-prefix (PR #7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants