Skip to content

llama: Wire up Qwen3.5/Qwen3.5MoE tensors for NVFP4 support#20506

Merged
CISC merged 2 commits intoggml-org:masterfrom
michaelw9999:nvfp4-qwen35-nvfp4-loader-fix
Mar 14, 2026
Merged

llama: Wire up Qwen3.5/Qwen3.5MoE tensors for NVFP4 support#20506
CISC merged 2 commits intoggml-org:masterfrom
michaelw9999:nvfp4-qwen35-nvfp4-loader-fix

Conversation

@michaelw9999
Copy link
Contributor

PR #20505 fixes the conversion errors for making Qwen3.5 NVFP4 GGUF files and properly reorders the Qwen3.5 linear attention layers, but without this update, those models will not load.

This update wires up the Qwen3.5 tensors so they are properly loaded from Qwen3.5 NVFP4 gguf files and follows the same design intent using build_lora_mm:

This links up the:
recurrent / linear-attention tensors
FFN tensors that were loading but were not using scales
MoE shared-expert FFN scales

@michaelw9999 michaelw9999 requested a review from CISC as a code owner March 13, 2026 12:12
Copilot AI review requested due to automatic review settings March 13, 2026 12:12
@github-actions github-actions bot added the model Model specific label Mar 13, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires Qwen3.5 and Qwen3.5MoE tensor scale metadata into the model build path so NVFP4 GGUFs load correctly (including linear-attention/recurrent and MoE/shared-expert FFN scale handling).

Changes:

  • Pass per-tensor scale tensors into build_lora_mm for attention and linear-attention (SSM) projections.
  • Pass per-tensor and per-expert scale tensors into FFN / MoE FFN builders (including shared experts).
  • Extend llama_layer and tensor loading to create optional scale tensors for the newly wired weights.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/models/qwen35moe.cpp Threads newly loaded scale tensors through Qwen3.5MoE attention, linear-attention, and MoE/shared-expert FFN paths.
src/models/qwen35.cpp Threads newly loaded scale tensors through Qwen3.5 attention, linear-attention, and dense FFN paths.
src/llama-model.h Adds layer members to store new scale tensors (QKV mixed, gate, SSM, shared-expert FFN scales).
src/llama-model.cpp Creates optional scale tensors for the new layer scale members during tensor loading.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@michaelw9999 michaelw9999 changed the title ggml: Wire up Qwen3.5/Qwen3.5MoE tensors for NVFP4 support llama: Wire up Qwen3.5/Qwen3.5MoE tensors for NVFP4 support Mar 13, 2026
@CISC CISC merged commit d23355a into ggml-org:master Mar 14, 2026
6 of 80 checks passed
@michaelw9999 michaelw9999 deleted the nvfp4-qwen35-nvfp4-loader-fix branch March 14, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants