Skip to content

Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes#22804

Merged
CISC merged 6 commits into
ggml-org:masterfrom
ynankani:ynankani/gemma4_moe_nvfp4_fixes
May 8, 2026
Merged

Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes#22804
CISC merged 6 commits into
ggml-org:masterfrom
ynankani:ynankani/gemma4_moe_nvfp4_fixes

Conversation

@ynankani

@ynankani ynankani commented May 7, 2026

Copy link
Copy Markdown
Contributor

Overview

Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes. This PR fixes the following:

  1. Excluded weight_scale, weight_scale_2, and input_scale from the existing + ".weight" rename for .experts. tensors. The original rename was causing issue with NVFP4 scale tensor names (e.g. experts.0.down_proj.weight_scale_2 => experts.0.down_proj.weight_scale_2.weight), breaking the NVFP4 lookup at _generate_nvfp4_tensors
  2. Added FFN_GATE_EXP, FFN_UP_EXP, alongside the existing FFN_GATE_UP_EXP in the GEMMA4 tensor allow-list. Originally only fused FFN_GATE_UP_EXP was allowed. HF NVFP4 checkpoints store gate/up/down as separate per-expert tensors, so the converter couldn't map them especially for NvFP4 . Other option was to re-quantize if want to fuse gate and up proj.
  3. made ffn_gate_up_exps TENSOR_NOT_REQUIRED/Optional, added fallback creation of separate ffn_gate_exps and ffn_up_exps if the fused tensor is absent
  4. Conditional plumbing in build_moe_ffn so that it passes either fused or separate tensors
  5. Pre-folds each layer's router.per_expert_scale into the corresponding expert's down_proj.weight_scale_2 at conversion time, then pop()s router.per_expert_scale from model_tensors so the existing modify_tensors mapping doesn't fire for NVFP4 conversions which was causing "Duplicated tensor name 'blk.0.ffn_down_exps.scale' error" as both per_expert_scale and weight_scale_2 were using same slot.

Additional information

Tested with https://huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, Took AI assistance for debugging and refactoring.

Signed-off-by: ynankani <ynankani@nvidia.com>
@ynankani ynankani requested a review from CISC as a code owner May 7, 2026 13:21
@github-actions github-actions Bot added model Model specific python python script changes labels May 7, 2026
Comment thread src/models/gemma4.cpp Outdated
Comment thread src/models/gemma4.cpp Outdated
Comment thread convert_hf_to_gguf.py Outdated
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Comment thread convert_hf_to_gguf.py
@catlilface

This comment was marked as off-topic.

Signed-off-by: ynankani <ynankani@nvidia.com>
@CISC

CISC commented May 8, 2026

Copy link
Copy Markdown
Member

BTW, it doesn't look like we support fused gate/up experts in _generate_nvfp4_tensors, but since Gemma4 got split, maybe this is not a thing (at least with ModelOpt)?

@CISC

CISC commented May 8, 2026

Copy link
Copy Markdown
Member

@ynankani Looks like GitHub UI snuck in \r\n again, can you normalize to \n?

Signed-off-by: ynankani <ynankani@nvidia.com>
@ynankani

ynankani commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

BTW, it doesn't look like we support fused gate/up experts in _generate_nvfp4_tensors, but since Gemma4 got split, maybe this is not a thing (at least with ModelOpt)?

I see ModelOpt treating gate and up experts as separate for quantization and export Link.
So it should be fine as separate gate and up are exported.

Also I see logic for calibration sync for gate and up projLink Link2
So the serving engine can fuse the gate and up at inference time.

@ynankani

ynankani commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

@ynankani Looks like GitHub UI snuck in \r\n again, can you normalize to \n?

Is it possible in .gitattributes to set eol=lf for python file?

@CISC

CISC commented May 8, 2026

Copy link
Copy Markdown
Member

@ynankani Looks like GitHub UI snuck in \r\n again, can you normalize to \n?

Is it possible in .gitattributes to set eol=lf for python file?

I'm not sure what the reason is, but it seems totally random, so I doubt there's anything we can do.

Signed-off-by: ynankani <ynankani@nvidia.com>
@CISC CISC merged commit 9f5f0e6 into ggml-org:master May 8, 2026
47 of 50 checks passed
cetarthoriphros pushed a commit to cetarthoriphros/llama.cpp that referenced this pull request May 9, 2026
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes

Signed-off-by: ynankani <ynankani@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address review comments

Signed-off-by: ynankani <ynankani@nvidia.com>

* fix CRLF

Signed-off-by: ynankani <ynankani@nvidia.com>

* Lint error fix

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes

Signed-off-by: ynankani <ynankani@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address review comments

Signed-off-by: ynankani <ynankani@nvidia.com>

* fix CRLF

Signed-off-by: ynankani <ynankani@nvidia.com>

* Lint error fix

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
pwilkin added a commit to pwilkin/llama.cpp that referenced this pull request May 13, 2026
Ports 15 upstream commits (05e141a..5d44db6) that touched the
monolithic convert_hf_to_gguf.py into the new conversion/*.py layout
introduced by the refactor split.

New text/mmproj architectures registered:
  GraniteSpeechForConditionalGeneration, MiMoV2ForCausalLM,
  MiniCPMV4_6ForConditionalGeneration, Sarashina2VisionForCausalLM,
  SarvamMoEForCausalLM (+ modeling_sarvam_moe.SarvamMoEForCausalLM).

Notable changes:
- filter_tensors classmethod added to ModelBase/TextModel/MmprojModel
  and wired into index_tensors; many model classes refactored to move
  tensor-name skip/rename logic out of modify_tensors and into
  filter_tensors (upstream ggml-org#22597).
- LlamaModel._repack_nvfp4 override (Q/K RoPE permutation, ggml-org#22611).
- MistralModel yarn apply_scale support (ggml-org#22612).
- Gemma4Model._generate_nvfp4_tensors override for 26B NVFP4 (ggml-org#22804).
- LlavaVisionModel image-break token fallback for Mistral params.json
  -1 placeholders (ggml-org#22914).
- Pixtral 12B --mistral-format conversion fixes (ggml-org#22981).
- FP8 KV-cache scales fix (ggml-org#22818) and uint dtype byteswap disable
  (ggml-org#18908).

New files:
  conversion/sarashina2.py (Sarashina2VL text + vision)
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes

Signed-off-by: ynankani <ynankani@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address review comments

Signed-off-by: ynankani <ynankani@nvidia.com>

* fix CRLF

Signed-off-by: ynankani <ynankani@nvidia.com>

* Lint error fix

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
jimbothigpen pushed a commit to jimbothigpen/llama.cpp that referenced this pull request May 21, 2026
The three files touched by mainline 9f5f0e6 are all already at the
target state in ygg HEAD:

  - convert_hf_to_gguf.py: ygg uses the per-arch refactor (Bundle C
    ggml-org#17114); Gemma4Model lives in conversion/gemma.py. mainline ggml-org#17114
    was merged AFTER ggml-org#22804 in upstream, so the refactor carried the
    NVFP4 _generate_nvfp4_tensors method and updated filter_tensors
    along with it. conversion/gemma.py lines 719-752 already contain
    the upstream additions verbatim.
  - gguf-py/gguf/constants.py: MODEL_ARCH.GEMMA4 block (line 2522+)
    already includes FFN_GATE_EXP and FFN_UP_EXP (auto-merged cleanly).
  - src/models/gemma4.cpp: ffn_gate_up_exps TENSOR_NOT_REQUIRED path
    and build_moe_ffn argument updates already in HEAD (auto-merged
    cleanly).

Empty commit retained for audit-trail/lineage; pairs with the previous
loader port (baddad949 = mainline 42928bc).
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes

Signed-off-by: ynankani <ynankani@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address review comments

Signed-off-by: ynankani <ynankani@nvidia.com>

* fix CRLF

Signed-off-by: ynankani <ynankani@nvidia.com>

* Lint error fix

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
jimbothigpen pushed a commit to jimbothigpen/llama.cpp that referenced this pull request May 25, 2026
The three files touched by mainline 9f5f0e6 are all already at the
target state in ygg HEAD:

  - convert_hf_to_gguf.py: ygg uses the per-arch refactor (Bundle C
    ggml-org#17114); Gemma4Model lives in conversion/gemma.py. mainline ggml-org#17114
    was merged AFTER ggml-org#22804 in upstream, so the refactor carried the
    NVFP4 _generate_nvfp4_tensors method and updated filter_tensors
    along with it. conversion/gemma.py lines 719-752 already contain
    the upstream additions verbatim.
  - gguf-py/gguf/constants.py: MODEL_ARCH.GEMMA4 block (line 2522+)
    already includes FFN_GATE_EXP and FFN_UP_EXP (auto-merged cleanly).
  - src/models/gemma4.cpp: ffn_gate_up_exps TENSOR_NOT_REQUIRED path
    and build_moe_ffn argument updates already in HEAD (auto-merged
    cleanly).

Empty commit retained for audit-trail/lineage; pairs with the previous
loader port (baddad949 = mainline 42928bc).
jimbothigpen pushed a commit to jimbothigpen/llama.cpp that referenced this pull request May 25, 2026
The three files touched by mainline 9f5f0e6 are all already at the
target state in ygg HEAD:

  - convert_hf_to_gguf.py: ygg uses the per-arch refactor (Bundle C
    ggml-org#17114); Gemma4Model lives in conversion/gemma.py. mainline ggml-org#17114
    was merged AFTER ggml-org#22804 in upstream, so the refactor carried the
    NVFP4 _generate_nvfp4_tensors method and updated filter_tensors
    along with it. conversion/gemma.py lines 719-752 already contain
    the upstream additions verbatim.
  - gguf-py/gguf/constants.py: MODEL_ARCH.GEMMA4 block (line 2522+)
    already includes FFN_GATE_EXP and FFN_UP_EXP (auto-merged cleanly).
  - src/models/gemma4.cpp: ffn_gate_up_exps TENSOR_NOT_REQUIRED path
    and build_moe_ffn argument updates already in HEAD (auto-merged
    cleanly).

Empty commit retained for audit-trail/lineage; pairs with the previous
loader port (baddad949 = mainline 42928bc).
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes

Signed-off-by: ynankani <ynankani@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address review comments

Signed-off-by: ynankani <ynankani@nvidia.com>

* fix CRLF

Signed-off-by: ynankani <ynankani@nvidia.com>

* Lint error fix

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes

Signed-off-by: ynankani <ynankani@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address review comments

Signed-off-by: ynankani <ynankani@nvidia.com>

* fix CRLF

Signed-off-by: ynankani <ynankani@nvidia.com>

* Lint error fix

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants