Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes by ynankani · Pull Request #22804 · ggml-org/llama.cpp

ynankani · 2026-05-07T13:21:33Z

Overview

Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes. This PR fixes the following:

Excluded weight_scale, weight_scale_2, and input_scale from the existing + ".weight" rename for .experts. tensors. The original rename was causing issue with NVFP4 scale tensor names (e.g. experts.0.down_proj.weight_scale_2 => experts.0.down_proj.weight_scale_2.weight), breaking the NVFP4 lookup at _generate_nvfp4_tensors
Added FFN_GATE_EXP, FFN_UP_EXP, alongside the existing FFN_GATE_UP_EXP in the GEMMA4 tensor allow-list. Originally only fused FFN_GATE_UP_EXP was allowed. HF NVFP4 checkpoints store gate/up/down as separate per-expert tensors, so the converter couldn't map them especially for NvFP4 . Other option was to re-quantize if want to fuse gate and up proj.
made ffn_gate_up_exps TENSOR_NOT_REQUIRED/Optional, added fallback creation of separate ffn_gate_exps and ffn_up_exps if the fused tensor is absent
Conditional plumbing in build_moe_ffn so that it passes either fused or separate tensors
Pre-folds each layer's router.per_expert_scale into the corresponding expert's down_proj.weight_scale_2 at conversion time, then pop()s router.per_expert_scale from model_tensors so the existing modify_tensors mapping doesn't fire for NVFP4 conversions which was causing "Duplicated tensor name 'blk.0.ffn_down_exps.scale' error" as both per_expert_scale and weight_scale_2 were using same slot.

Additional information

Tested with https://huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, Took AI assistance for debugging and refactoring.

Signed-off-by: ynankani <ynankani@nvidia.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Signed-off-by: ynankani <ynankani@nvidia.com>

CISC · 2026-05-08T09:50:07Z

BTW, it doesn't look like we support fused gate/up experts in _generate_nvfp4_tensors, but since Gemma4 got split, maybe this is not a thing (at least with ModelOpt)?

CISC · 2026-05-08T09:53:44Z

@ynankani Looks like GitHub UI snuck in \r\n again, can you normalize to \n?

Signed-off-by: ynankani <ynankani@nvidia.com>

ynankani · 2026-05-08T11:23:29Z

BTW, it doesn't look like we support fused gate/up experts in _generate_nvfp4_tensors, but since Gemma4 got split, maybe this is not a thing (at least with ModelOpt)?

I see ModelOpt treating gate and up experts as separate for quantization and export Link.
So it should be fine as separate gate and up are exported.

Also I see logic for calibration sync for gate and up projLink Link2
So the serving engine can fuse the gate and up at inference time.

ynankani · 2026-05-08T11:29:41Z

@ynankani Looks like GitHub UI snuck in \r\n again, can you normalize to \n?

Is it possible in .gitattributes to set eol=lf for python file?

CISC · 2026-05-08T11:32:10Z

@ynankani Looks like GitHub UI snuck in \r\n again, can you normalize to \n?

Is it possible in .gitattributes to set eol=lf for python file?

I'm not sure what the reason is, but it seems totally random, so I doubt there's anything we can do.

Signed-off-by: ynankani <ynankani@nvidia.com>

* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes Signed-off-by: ynankani <ynankani@nvidia.com> * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address review comments Signed-off-by: ynankani <ynankani@nvidia.com> * fix CRLF Signed-off-by: ynankani <ynankani@nvidia.com> * Lint error fix Signed-off-by: ynankani <ynankani@nvidia.com> --------- Signed-off-by: ynankani <ynankani@nvidia.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Ports 15 upstream commits (05e141a..5d44db6) that touched the monolithic convert_hf_to_gguf.py into the new conversion/*.py layout introduced by the refactor split. New text/mmproj architectures registered: GraniteSpeechForConditionalGeneration, MiMoV2ForCausalLM, MiniCPMV4_6ForConditionalGeneration, Sarashina2VisionForCausalLM, SarvamMoEForCausalLM (+ modeling_sarvam_moe.SarvamMoEForCausalLM). Notable changes: - filter_tensors classmethod added to ModelBase/TextModel/MmprojModel and wired into index_tensors; many model classes refactored to move tensor-name skip/rename logic out of modify_tensors and into filter_tensors (upstream ggml-org#22597). - LlamaModel._repack_nvfp4 override (Q/K RoPE permutation, ggml-org#22611). - MistralModel yarn apply_scale support (ggml-org#22612). - Gemma4Model._generate_nvfp4_tensors override for 26B NVFP4 (ggml-org#22804). - LlavaVisionModel image-break token fallback for Mistral params.json -1 placeholders (ggml-org#22914). - Pixtral 12B --mistral-format conversion fixes (ggml-org#22981). - FP8 KV-cache scales fix (ggml-org#22818) and uint dtype byteswap disable (ggml-org#18908). New files: conversion/sarashina2.py (Sarashina2VL text + vision)

* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes Signed-off-by: ynankani <ynankani@nvidia.com> * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address review comments Signed-off-by: ynankani <ynankani@nvidia.com> * fix CRLF Signed-off-by: ynankani <ynankani@nvidia.com> * Lint error fix Signed-off-by: ynankani <ynankani@nvidia.com> --------- Signed-off-by: ynankani <ynankani@nvidia.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

The three files touched by mainline 9f5f0e6 are all already at the target state in ygg HEAD: - convert_hf_to_gguf.py: ygg uses the per-arch refactor (Bundle C ggml-org#17114); Gemma4Model lives in conversion/gemma.py. mainline ggml-org#17114 was merged AFTER ggml-org#22804 in upstream, so the refactor carried the NVFP4 _generate_nvfp4_tensors method and updated filter_tensors along with it. conversion/gemma.py lines 719-752 already contain the upstream additions verbatim. - gguf-py/gguf/constants.py: MODEL_ARCH.GEMMA4 block (line 2522+) already includes FFN_GATE_EXP and FFN_UP_EXP (auto-merged cleanly). - src/models/gemma4.cpp: ffn_gate_up_exps TENSOR_NOT_REQUIRED path and build_moe_ffn argument updates already in HEAD (auto-merged cleanly). Empty commit retained for audit-trail/lineage; pairs with the previous loader port (baddad949 = mainline 42928bc).

* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes Signed-off-by: ynankani <ynankani@nvidia.com> * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address review comments Signed-off-by: ynankani <ynankani@nvidia.com> * fix CRLF Signed-off-by: ynankani <ynankani@nvidia.com> * Lint error fix Signed-off-by: ynankani <ynankani@nvidia.com> --------- Signed-off-by: ynankani <ynankani@nvidia.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

The three files touched by mainline 9f5f0e6 are all already at the target state in ygg HEAD: - convert_hf_to_gguf.py: ygg uses the per-arch refactor (Bundle C ggml-org#17114); Gemma4Model lives in conversion/gemma.py. mainline ggml-org#17114 was merged AFTER ggml-org#22804 in upstream, so the refactor carried the NVFP4 _generate_nvfp4_tensors method and updated filter_tensors along with it. conversion/gemma.py lines 719-752 already contain the upstream additions verbatim. - gguf-py/gguf/constants.py: MODEL_ARCH.GEMMA4 block (line 2522+) already includes FFN_GATE_EXP and FFN_UP_EXP (auto-merged cleanly). - src/models/gemma4.cpp: ffn_gate_up_exps TENSOR_NOT_REQUIRED path and build_moe_ffn argument updates already in HEAD (auto-merged cleanly). Empty commit retained for audit-trail/lineage; pairs with the previous loader port (baddad949 = mainline 42928bc).

* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes Signed-off-by: ynankani <ynankani@nvidia.com> * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address review comments Signed-off-by: ynankani <ynankani@nvidia.com> * fix CRLF Signed-off-by: ynankani <ynankani@nvidia.com> * Lint error fix Signed-off-by: ynankani <ynankani@nvidia.com> --------- Signed-off-by: ynankani <ynankani@nvidia.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes

328131c

Signed-off-by: ynankani <ynankani@nvidia.com>

ynankani requested a review from CISC as a code owner May 7, 2026 13:21

github-actions Bot added model Model specific python python script changes labels May 7, 2026

CISC approved these changes May 7, 2026

View reviewed changes

Comment thread src/models/gemma4.cpp Outdated

Comment thread src/models/gemma4.cpp Outdated

Comment thread convert_hf_to_gguf.py Outdated

Apply suggestions from code review

9678f88

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

gaugarg-nv reviewed May 8, 2026

View reviewed changes

Comment thread convert_hf_to_gguf.py

This comment was marked as off-topic.

Sign in to view

Address review comments

f34ae25

Signed-off-by: ynankani <ynankani@nvidia.com>

gaugarg-nv approved these changes May 8, 2026

View reviewed changes

Merge branch 'ggml-org:master' into ynankani/gemma4_moe_nvfp4_fixes

8f4159d

fix CRLF

3632560

Signed-off-by: ynankani <ynankani@nvidia.com>

Lint error fix

7747cc3

Signed-off-by: ynankani <ynankani@nvidia.com>

CISC approved these changes May 8, 2026

View reviewed changes

CISC merged commit 9f5f0e6 into ggml-org:master May 8, 2026
47 of 50 checks passed

Huckies mentioned this pull request May 9, 2026

NVDIA nvfp4 support for ollama ollama/ollama#16056

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes#22804

Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes#22804
CISC merged 6 commits into
ggml-org:masterfrom
ynankani:ynankani/gemma4_moe_nvfp4_fixes

ynankani commented May 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as off-topic.

CISC commented May 8, 2026

Uh oh!

CISC commented May 8, 2026

Uh oh!

ynankani commented May 8, 2026 •

edited

Loading

Uh oh!

ynankani commented May 8, 2026

Uh oh!

CISC commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ynankani commented May 7, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as off-topic.

CISC commented May 8, 2026

Uh oh!

CISC commented May 8, 2026

Uh oh!

ynankani commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ynankani commented May 8, 2026

Uh oh!

CISC commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ynankani commented May 8, 2026 •

edited

Loading