Fix: quantize and target moe layers in transformers v5 for adapters and many misc fixes by NanoCode012 · Pull Request #3439 · axolotl-ai-cloud/axolotl

NanoCode012 · 2026-02-26T14:52:25Z

Description

Problem: Experts aren't properly targeted with the normal lora_target_modules, we need to use lora_target_parameters. Additionally, expert layers aren't being quantized by bnb.

Based on #3395 work by ved (without the lora kernels changes) and contains additional fixes:

quantizes experts on-load (reduces reserved memory)
- qlora 4bit support
- lora 8bit support
- ensure target_parameter works (names are expanded otherwise)
fsdp2 support
- qlora 4bit
- lora 8bit
experts properly targeted with quantized experts
- qlora 4bit
- lora 8bit

How to use

(see included configs)

quantize_moe_experts: true

#lora_target_parameters:
#    - mlp.experts.gate_up_proj
#    - mlp.experts.down_proj
#    - mlp.gate.weight  # router

Results

From a QLoRA training using 127GiB peak memory, we managed to reduce till 23GiB.

Loss line differs as we swapped optim (adamw_bnb_8bit -> adamw_torch_8bit)+ nodes while working on this. Verified that without our fix, the prior line is consistent.

We also incorporated these changes onto LoRA and FSDP2 LoRA/QLoRA trainings.

Limitation

cpu ram efficient loading fsdp2 doesn't work with qlora (did not test lora): AttributeError: e_score_correction_bias is not an nn.Parameter due to how it's instantiated
total param count for model is wrong; trainable param is correct (due to how we handle quantize experts). This is only cosmetic bug, it still target correct number of params
fsdp lora will have huge initial vram spike at first 1-2 steps then drop (unclear reason atm). fsdp qlora was fine
Deepspeed not tested.
bf16 lora not tested
Model may take longer to load due to quantize on demand, even for consecutive runs. May need to consider caching in future work.

Misc:

re-adds saving changes that were accidentally removed in ScatterMoE LoRA support #3410
CCE updates to include fixes for qwen3_5 and qwen3_5_moe
on-save, state dict cloned onto gpu which doubles vram usage. This PR ensures it's moved onto cpu and only for CP

How to use

See included example yamls and README

Motivation and Context

How has this been tested?

Training loss consistent with our on-load quantize as the previous post-quantize change. Tested across single GPU for LoRA and QLoRA respectively. Also ensured FSDP/DDP loss were within reasonable expectations.

TODO (ongoing):

merge adapters to base (should work as before as we don't apply any quantize on load)
- qlora
- lora
inference on adapters (loss / ppl / grad norm looks fine)
- qlora
- lora

AI Usage Disclaimer

Claude heavily while checking result via manual runs

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Thanks to ved for original PR to base off and help test throughout.

Summary by CodeRabbit

Release Notes

New Features
- Added GLM-4.7-Flash model support with multiple fine-tuning configurations (LoRA, QLoRA, and distributed training variants).
- Enabled MoE expert quantization to reduce VRAM usage during training.
- Added FSDP2-based distributed training support for large models.
Bug Fixes
- Fixed checkpoint saving for context-parallel setups.
- Fixed DTensor compatibility issues with distributed training.
Documentation
- Added comprehensive guides for GLM-4.7-Flash fine-tuning with various configurations.
- Updated installation instructions and setup steps.
- Added limitation notes for new model architectures.

require_grad=true on init

This reverts commit 1d54518.

coderabbitai · 2026-02-26T14:53:01Z

📝 Walkthrough

Walkthrough

This PR introduces MoE expert quantization support, adds GLM-4.7-Flash model fine-tuning configurations, extends FSDP2 with DTensor handling for QLoRA, updates the Cut Cross Entropy integration, adds MOE architecture mappings, and refines checkpoint saving for context-parallel setups.

Changes

Cohort / File(s)	Summary
Cut Cross Entropy Integration `examples/colab-notebooks/colab-axolotl-example.ipynb`, `scripts/cutcrossentropy_install.py`, `src/axolotl/integrations/cut_cross_entropy/README.md`, `src/axolotl/integrations/cut_cross_entropy/__init__.py`	Updated ml-cross-entropy git commit hash from 58d6572 to a668583 across installation references and help messages; updated plugin list in integration README.
GLM-4.7-Flash Examples `examples/glm4.7-flash/README.md`, `examples/glm4.7-flash/lora.yaml`, `examples/glm4.7-flash/lora_fsdp.yaml`, `examples/glm4.7-flash/qlora.yaml`, `examples/glm4.7-flash/qlora_fsdp.yaml`	Added comprehensive documentation and configuration files for fine-tuning GLM-4.7-Flash with LoRA/QLoRA and FSDP variants; configurations include 8-bit/4-bit loading, CutCrossEntropyPlugin, LoRA settings, and FSDP2 parameters.
Trinity Examples `examples/trinity/README.md`, `examples/trinity/trinity-nano-preview-qlora.yaml`	Reordered installation steps to include Cut Cross Entropy setup; updated VRAM usage note; removed unsupported limitations section; removed `trust_remote_code` from YAML config.
MOE Architecture Support `src/axolotl/common/architectures.py`	Added three new MOE architecture mappings to MOE_ARCH_BLOCK: glm4_moe, glm4_moe_lite, and glm_moe_dsa.
Model Checkpoint Saving `src/axolotl/core/trainers/base.py`	Modified checkpoint saving to conditionally convert state_dict to CPU for context-parallel setups; added is_main_process flag propagation to save_pretrained calls.
MoE Expert Quantization `src/axolotl/monkeypatch/moe_quant.py`	Introduced new module for on-the-fly quantization of MoE expert weights during model loading; includes Bnb8bitParametrization class, quantization patches, and utilities for 4-bit/8-bit quantization with configurable compression.
FSDP2 and QLoRA Extensions `src/axolotl/monkeypatch/accelerate/fsdp2.py`, `src/axolotl/monkeypatch/fsdp2_qlora.py`	Added DTensor support for FSDP2 state_dict retrieval and PEFT parameter wrappers; extended FSDP2 patching for Int8Params and Params4bit variants; added patches for Linear8bitLt save and FSDPParam dtype attribute initialization.
Model Loader Integration `src/axolotl/loaders/model.py`	Added post-model-build patch application step; introduced guard to skip k-bit preparation when MOE experts are quantized.
Patch Manager `src/axolotl/loaders/patch_manager.py`	Added MoE expert quantization patch methods; wired in pre-load, post-load, and post-build patch hooks; extended FSDP2 + QLoRA handling with additional patches for dtype attributes and Linear8bitLt save.
Configuration Schemas `src/axolotl/utils/schemas/config.py`, `src/axolotl/utils/schemas/peft.py`	Added quantize_moe_experts boolean field with validators requiring LoRA/QLoRA adapter and 4-bit/8-bit loading; added validator in PeftConfig to enforce zero dropout when lora_target_parameters is set.
Documentation `src/axolotl/integrations/kernels/README.md`	Added limitation note that ScatterMoE does not work with GLM4.7 Flash (glm4_moe_lite).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

FSDP2 + LoRA kernels #2992: Directly related for FSDP2↔LoRA/QLoRA integration patches and PatchManager wiring with quantized parameter handling.
feat: upgrade cce commit to include smollm3, granite, granitemoe #2993: Related for Cut Cross Entropy installation reference updates across multiple files.
update transformers to 4.53.1 #2844: Related for PatchManager class modifications adding new MoE quantization patch hook methods.

Suggested labels

scheduled_release

Suggested reviewers

djsaunde
winglian

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 52.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main change: fixing MoE layer quantization and targeting in transformers v5 for adapters, along with miscellaneous related fixes mentioned throughout the changeset.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/v5-moe-qlora

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/axolotl/core/trainers/base.py (1)
753-757: ⚠️ Potential issue | 🔴 Critical

Unconditional is_main_process kwarg will break custom model saves.

Lines 753-757 and 768-772 unconditionally pass is_main_process to save_pretrained(). Custom model overrides that don't accept this parameter (e.g., src/axolotl/models/mamba/modeling_mamba.py:110-114) will raise TypeError during checkpoint save. The mamba model's save_pretrained signature accepts only save_directory and state_dict, with no **kwargs to absorb additional parameters.

This requires conditional passing of the kwarg only when the method signature supports it, such as using inspect.signature() to check before calling.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/core/trainers/base.py` around lines 753 - 757, The
save_pretrained call is passing is_main_process unconditionally which breaks
custom implementations (e.g., modeling_mamba's save_pretrained) that don't
accept that kwarg; update the code that calls save_pretrained (the call where
state_dict and is_main_process are passed) to first inspect the save_pretrained
signature (use inspect.signature on the model's save_pretrained method) and only
include is_main_process when the parameter is accepted (otherwise call with only
save_directory and state_dict); ensure you reference the same save_pretrained
method used in the trainer and use self.accelerator.is_main_process for the
flag.

🧹 Nitpick comments (2)

examples/glm4.7-flash/qlora.yaml (1)

26-34: Clarify whether this example adapts experts or only attention projections.

Right now the active config targets q_proj/v_proj/k_proj/o_proj, while expert parameter targets are commented out. A brief inline note would prevent confusion for users expecting MoE expert adaptation in this example.

📝 Suggested clarification

 lora_target_modules:
   - q_proj
   - v_proj
   - k_proj
   - o_proj
+# This example fine-tunes attention projections only.
+# To also adapt MoE expert tensors, use lora_target_parameters:
 # lora_target_parameters:
 #   - mlp.experts.gate_up_proj
 #   - mlp.experts.down_proj

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/glm4.7-flash/qlora.yaml` around lines 26 - 34, The config currently
only adapts attention projection modules (lora_target_modules: q_proj, v_proj,
k_proj, o_proj) while the MoE expert parameter targets (lora_target_parameters:
mlp.experts.gate_up_proj, mlp.experts.down_proj) are commented out; update the
YAML to make intent explicit by either uncommenting the lora_target_parameters
entries (and enabling lora_mlp_kernel if needed) when you intend to adapt
experts, or add a one-line inline comment above lora_target_modules explaining
this example only adapts attention projections and does not modify MoE expert
parameters.

src/axolotl/loaders/patch_manager.py (1)

380-387: Limit PEFT monkeypatch to adapter flows and apply once.

Line 386 applies a global PEFT method patch whenever quantize_moe_experts is true, even if no adapter path is active. Consider gating this on adapter usage and adding a one-time guard to reduce global side effects.

♻️ Proposed fix

     patch_moe_quantization_on_load(self.cfg)
-    patch_peft_target_parameters_matching()
+    if self.cfg.adapter:
+        patch_peft_target_parameters_matching()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/loaders/patch_manager.py` around lines 380 - 387, Only apply the
PEFT monkeypatch when adapters are actually in use and ensure it's applied once:
check the config flag that indicates an adapter path is active (e.g.,
self.cfg.adapter_path or equivalent adapter-enabled field) before calling
patch_peft_target_parameters_matching(), and guard the call with a one-time flag
(module-level or attribute like self._peft_patch_applied) so subsequent loads
don't reapply the global patch; keep patch_moe_quantization_on_load(self.cfg)
behavior unchanged but move or wrap patch_peft_target_parameters_matching()
behind the adapter check and one-time guard.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/glm4.7-flash/README.md`:
- Line 63: The README's full-finetune tip is incomplete: when instructing users
to remove adapter: qlora and load_in_4bit: true from the FSDP2 config, also
disable or remove quantize_moe_experts because it currently requires adapter to
be lora or qlora and will fail validation; update the sentence to say to set
quantize_moe_experts: false (or remove that key) in the FSDP2 config when doing
a full finetune so validation passes.

In `@scripts/cutcrossentropy_install.py`:
- Line 32: The pip-install line in scripts/cutcrossentropy_install.py currently
pins the repository with a short commit hash ("a668583") in the string
f'{UV_PREFIX}pip install "cut-cross-entropy[transformers] @
git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@a668583"'; replace
that abbreviated SHA with the repository's full 40-character commit SHA to
ensure deterministic dependency resolution (look up the full SHA for the
intended commit in the ml-cross-entropy repo and update the string accordingly).

In `@src/axolotl/monkeypatch/fsdp2_qlora.py`:
- Around line 166-193: The runtime monkeypatch re-wraps methods on every call;
modify apply_linear8bitlt_save_patch to be idempotent by detecting and skipping
if already applied: capture and store the original function once (e.g., a
module-level variable or by checking an attribute on
bnb.nn.Linear8bitLt._save_to_state_dict), return immediately if the method is
already the patched wrapper, and when installing the wrapper, mark the patched
function with a sentinel attribute (e.g., _is_axolotl_patched = True) so
repeated calls do nothing; apply the same pattern to
apply_init_dtype_attrs_patch (use unique sentinel names and preserved originals
like original_save / original_init_dtype_attrs) to avoid stacking wrappers.

In `@src/axolotl/monkeypatch/moe_quant.py`:
- Around line 82-107: Compute and write the current runtime quantization state
into _moe_load_state before the early "already patched" return: determine mode
from cfg (same logic using getattr(cfg, "load_in_8bit", False) to select "8bit"
or "4bit"), set _moe_load_state["mode"] = mode and _moe_load_state["count"] = 0
(and if applicable pre-populate
_moe_load_state["quant_type"]/_moe_load_state["compress_statistics"] when mode
== "4bit") before the if _moe_load_state["patched"] check so later loads don't
see stale state from prior loads.

In `@src/axolotl/utils/schemas/config.py`:
- Around line 632-641: Extend the existing Pydantic validation that currently
checks adapter + load_in_4bit/load_in_8bit to also enforce that
quantize_moe_experts can only be true when the runtime backend is CUDA (reject
when backend is ROCm/other); specifically, in the same validator that references
quantize_moe_experts and the load_in_4bit/load_in_8bit flags, add a check that
the configured backend/device backend (or runtime torch backend detection)
indicates CUDA and raise a ValidationError if quantize_moe_experts is true but
the backend is not CUDA so the config fails validation early.

---

Outside diff comments:
In `@src/axolotl/core/trainers/base.py`:
- Around line 753-757: The save_pretrained call is passing is_main_process
unconditionally which breaks custom implementations (e.g., modeling_mamba's
save_pretrained) that don't accept that kwarg; update the code that calls
save_pretrained (the call where state_dict and is_main_process are passed) to
first inspect the save_pretrained signature (use inspect.signature on the
model's save_pretrained method) and only include is_main_process when the
parameter is accepted (otherwise call with only save_directory and state_dict);
ensure you reference the same save_pretrained method used in the trainer and use
self.accelerator.is_main_process for the flag.

---

Nitpick comments:
In `@examples/glm4.7-flash/qlora.yaml`:
- Around line 26-34: The config currently only adapts attention projection
modules (lora_target_modules: q_proj, v_proj, k_proj, o_proj) while the MoE
expert parameter targets (lora_target_parameters: mlp.experts.gate_up_proj,
mlp.experts.down_proj) are commented out; update the YAML to make intent
explicit by either uncommenting the lora_target_parameters entries (and enabling
lora_mlp_kernel if needed) when you intend to adapt experts, or add a one-line
inline comment above lora_target_modules explaining this example only adapts
attention projections and does not modify MoE expert parameters.

In `@src/axolotl/loaders/patch_manager.py`:
- Around line 380-387: Only apply the PEFT monkeypatch when adapters are
actually in use and ensure it's applied once: check the config flag that
indicates an adapter path is active (e.g., self.cfg.adapter_path or equivalent
adapter-enabled field) before calling patch_peft_target_parameters_matching(),
and guard the call with a one-time flag (module-level or attribute like
self._peft_patch_applied) so subsequent loads don't reapply the global patch;
keep patch_moe_quantization_on_load(self.cfg) behavior unchanged but move or
wrap patch_peft_target_parameters_matching() behind the adapter check and
one-time guard.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18f26c1 and d6475de.

📒 Files selected for processing (21)

examples/colab-notebooks/colab-axolotl-example.ipynb
examples/glm4.7-flash/README.md
examples/glm4.7-flash/lora.yaml
examples/glm4.7-flash/lora_fsdp.yaml
examples/glm4.7-flash/qlora.yaml
examples/glm4.7-flash/qlora_fsdp.yaml
examples/trinity/README.md
examples/trinity/trinity-nano-preview-qlora.yaml
scripts/cutcrossentropy_install.py
src/axolotl/common/architectures.py
src/axolotl/core/trainers/base.py
src/axolotl/integrations/cut_cross_entropy/README.md
src/axolotl/integrations/cut_cross_entropy/__init__.py
src/axolotl/integrations/kernels/README.md
src/axolotl/loaders/model.py
src/axolotl/loaders/patch_manager.py
src/axolotl/monkeypatch/accelerate/fsdp2.py
src/axolotl/monkeypatch/fsdp2_qlora.py
src/axolotl/monkeypatch/moe_quant.py
src/axolotl/utils/schemas/config.py
src/axolotl/utils/schemas/peft.py

💤 Files with no reviewable changes (1)

examples/trinity/trinity-nano-preview-qlora.yaml

coderabbitai · 2026-02-27T09:40:31Z

+  - `temperature: 1.0`
+  - `top_p: 0.95`
+  - `max_new_tokens: 131072`
+- You can run a full finetuning by removing the `adapter: qlora` and `load_in_4bit: true` from the FSDP2 config. This is heavy, so we have not tested this.


⚠️ Potential issue | 🟡 Minor

Update the full-finetune tip to disable quantize_moe_experts.

Line 63 currently tells users to remove adapter: qlora and load_in_4bit: true, but the config still contains quantize_moe_experts: true, which now requires adapter to be lora or qlora and will fail validation.

📝 Suggested doc fix

-- You can run a full finetuning by removing the `adapter: qlora` and `load_in_4bit: true` from the FSDP2 config. This is heavy, so we have not tested this. +- You can run a full finetuning by removing `adapter: qlora`, removing `load_in_4bit: true`, and setting `quantize_moe_experts: false` in the FSDP2 config. This is heavy, so we have not tested this.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- You can run a full finetuning by removing the `adapter: qlora` and `load_in_4bit: true` from the FSDP2 config. This is heavy, so we have not tested this.

- You can run a full finetuning by removing `adapter: qlora`, removing `load_in_4bit: true`, and setting `quantize_moe_experts: false` in the FSDP2 config. This is heavy, so we have not tested this.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/glm4.7-flash/README.md` at line 63, The README's full-finetune tip is incomplete: when instructing users to remove adapter: qlora and load_in_4bit: true from the FSDP2 config, also disable or remove quantize_moe_experts because it currently requires adapter to be lora or qlora and will fail validation; update the sentence to say to set quantize_moe_experts: false (or remove that key) in the FSDP2 config when doing a full finetune so validation passes.

coderabbitai · 2026-02-27T09:40:31Z

+def apply_linear8bitlt_save_patch():
+    """Patch Linear8bitLt._save_to_state_dict to handle DTensor-wrapped Int8Params.
+
+    After FSDP2 sharding, Linear8bitLt.weight is a DTensor wrapping Int8Params.
+    BnB's _save_to_state_dict accesses self.weight.SCB directly, but DTensor
+    doesn't proxy custom attribute access to its _local_tensor. This patch
+    temporarily unwraps the DTensor during saving so BnB can find the SCB attribute.
+    """
+    import bitsandbytes as bnb
+    from torch.distributed.tensor import DTensor
+
+    original_save = bnb.nn.Linear8bitLt._save_to_state_dict
+
+    def _patched_save_to_state_dict(self, destination, prefix, keep_vars):
+        # Use _parameters dict directly to bypass nn.Module.__setattr__ type check.
+        weight = self._parameters["weight"]
+        unwrapped = False
+        if isinstance(weight, DTensor) and hasattr(weight, "_local_tensor"):
+            self._parameters["weight"] = weight._local_tensor
+            unwrapped = True
+        try:
+            original_save(self, destination, prefix, keep_vars)
+        finally:
+            if unwrapped:
+                self._parameters["weight"] = weight
+
+    bnb.nn.Linear8bitLt._save_to_state_dict = _patched_save_to_state_dict
+    LOG.info("Patched Linear8bitLt._save_to_state_dict for DTensor compatibility")


⚠️ Potential issue | 🟠 Major

Add idempotency guards for runtime monkeypatches.

apply_linear8bitlt_save_patch() and apply_init_dtype_attrs_patch() re-wrap methods on every call. In repeated model-load flows this stacks wrappers and can cause hard-to-debug behavior/perf regressions.

♻️ Proposed fix

def apply_linear8bitlt_save_patch(): @@ import bitsandbytes as bnb from torch.distributed.tensor import DTensor + if getattr(bnb.nn.Linear8bitLt, "_axolotl_save_patch_applied", False): + return + original_save = bnb.nn.Linear8bitLt._save_to_state_dict @@ bnb.nn.Linear8bitLt._save_to_state_dict = _patched_save_to_state_dict + bnb.nn.Linear8bitLt._axolotl_save_patch_applied = True LOG.info("Patched Linear8bitLt._save_to_state_dict for DTensor compatibility") @@ def apply_init_dtype_attrs_patch(): @@ from torch.distributed.fsdp._fully_shard._fsdp_param import FSDPParam + if getattr(FSDPParam, "_axolotl_init_dtype_attrs_patch_applied", False): + return + original_init_dtype_attrs = FSDPParam.init_dtype_attrs @@ FSDPParam.init_dtype_attrs = patched_init_dtype_attrs + FSDPParam._axolotl_init_dtype_attrs_patch_applied = True LOG.info("Patched FSDPParam.init_dtype_attrs for non-float quantized params")

Also applies to: 196-224

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/monkeypatch/fsdp2_qlora.py` around lines 166 - 193, The runtime monkeypatch re-wraps methods on every call; modify apply_linear8bitlt_save_patch to be idempotent by detecting and skipping if already applied: capture and store the original function once (e.g., a module-level variable or by checking an attribute on bnb.nn.Linear8bitLt._save_to_state_dict), return immediately if the method is already the patched wrapper, and when installing the wrapper, mark the patched function with a sentinel attribute (e.g., _is_axolotl_patched = True) so repeated calls do nothing; apply the same pattern to apply_init_dtype_attrs_patch (use unique sentinel names and preserved originals like original_save / original_init_dtype_attrs) to avoid stacking wrappers.

coderabbitai · 2026-02-27T09:40:31Z

+    if _moe_load_state["patched"]:
+        LOG.debug("MoE loading-time quantization patch already active")
+        return
+
+    import transformers.core_model_loading
+    import transformers.modeling_utils
+
+    if getattr(cfg, "load_in_8bit", False):
+        mode = "8bit"
+    else:
+        mode = "4bit"
+
+    _moe_load_state["mode"] = mode
+    _moe_load_state["count"] = 0
+
+    if mode == "4bit":
+        from bitsandbytes.nn.parametrize import replace_parameter_4bit
+
+        quant_type = getattr(cfg, "bnb_4bit_quant_type", None) or "nf4"
+        compress_statistics = getattr(cfg, "bnb_4bit_use_double_quant", None)
+        if compress_statistics is None:
+            compress_statistics = True
+
+        _moe_load_state["quant_type"] = quant_type
+        _moe_load_state["compress_statistics"] = compress_statistics
+


⚠️ Potential issue | 🟠 Major

Reset runtime quantization state before the “already patched” early return.

At Line 82, returning immediately skips updating _moe_load_state["mode"] and _moe_load_state["count"]. In multi-load processes, later loads can run with stale mode/count from a prior model load.

🐛 Proposed fix

def patch_moe_quantization_on_load(cfg): @@ - if _moe_load_state["patched"]: - LOG.debug("MoE loading-time quantization patch already active") - return - import transformers.core_model_loading import transformers.modeling_utils if getattr(cfg, "load_in_8bit", False): mode = "8bit" @@ _moe_load_state["mode"] = mode _moe_load_state["count"] = 0 if mode == "4bit": from bitsandbytes.nn.parametrize import replace_parameter_4bit @@ _moe_load_state["quant_type"] = quant_type _moe_load_state["compress_statistics"] = compress_statistics + + if _moe_load_state["patched"]: + LOG.debug( + "MoE loading-time quantization patch already active; refreshed runtime state" + ) + return

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if _moe_load_state["patched"]:

LOG.debug("MoE loading-time quantization patch already active")

return

import transformers.core_model_loading

import transformers.modeling_utils

if getattr(cfg, "load_in_8bit", False):

mode = "8bit"

else:

mode = "4bit"

_moe_load_state["mode"] = mode

_moe_load_state["count"] = 0

if mode == "4bit":

from bitsandbytes.nn.parametrize import replace_parameter_4bit

quant_type = getattr(cfg, "bnb_4bit_quant_type", None) or "nf4"

compress_statistics = getattr(cfg, "bnb_4bit_use_double_quant", None)

if compress_statistics is None:

compress_statistics = True

_moe_load_state["quant_type"] = quant_type

_moe_load_state["compress_statistics"] = compress_statistics

import transformers.core_model_loading

import transformers.modeling_utils

if getattr(cfg, "load_in_8bit", False):

mode = "8bit"

else:

mode = "4bit"

_moe_load_state["mode"] = mode

_moe_load_state["count"] = 0

if mode == "4bit":

from bitsandbytes.nn.parametrize import replace_parameter_4bit

quant_type = getattr(cfg, "bnb_4bit_quant_type", None) or "nf4"

compress_statistics = getattr(cfg, "bnb_4bit_use_double_quant", None)

if compress_statistics is None:

compress_statistics = True

_moe_load_state["quant_type"] = quant_type

_moe_load_state["compress_statistics"] = compress_statistics

if _moe_load_state["patched"]:

LOG.debug(

"MoE loading-time quantization patch already active; refreshed runtime state"

)

return

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/monkeypatch/moe_quant.py` around lines 82 - 107, Compute and write the current runtime quantization state into _moe_load_state before the early "already patched" return: determine mode from cfg (same logic using getattr(cfg, "load_in_8bit", False) to select "8bit" or "4bit"), set _moe_load_state["mode"] = mode and _moe_load_state["count"] = 0 (and if applicable pre-populate _moe_load_state["quant_type"]/_moe_load_state["compress_statistics"] when mode == "4bit") before the if _moe_load_state["patched"] check so later loads don't see stale state from prior loads.

coderabbitai · 2026-02-27T09:40:31Z

+    quantize_moe_experts: bool = Field(
+        default=False,
+        json_schema_extra={
+            "description": "Quantize MoE expert weights on load to reduce VRAM. "
+            "Requires adapter (lora/qlora) with load_in_4bit or load_in_8bit. "
+            "Requires CUDA (not compatible with ROCm or other backends). "
+            "Note: total parameter count may be reported incorrectly when enabled "
+            "(trainable param count is correct)."
+        },
+    )


⚠️ Potential issue | 🟠 Major

quantize_moe_experts is documented as CUDA-only but backend is not validated.

Line 637 states this is not compatible with ROCm/other backends, but Lines 1305-1313 only validate adapter + bit-loading flags. Unsupported backends can pass config validation and fail later at runtime.

✅ Suggested validator extension

`@model_validator`(mode="before") `@classmethod` def check_quantize_moe_experts(cls, data): if data.get("quantize_moe_experts"): + capabilities = data.get("capabilities") or {} + compute_capability = str(capabilities.get("compute_capability", "")) + if not compute_capability.startswith("sm_"): + raise ValueError( + "quantize_moe_experts requires CUDA/NVIDIA (compute capability sm_*)." + ) if data.get("adapter") not in ("lora", "qlora"): raise ValueError("quantize_moe_experts requires adapter: lora or qlora") if not (data.get("load_in_4bit") or data.get("load_in_8bit")): raise ValueError( "quantize_moe_experts requires load_in_4bit or load_in_8bit" ) return data

Also applies to: 1305-1313

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/utils/schemas/config.py` around lines 632 - 641, Extend the existing Pydantic validation that currently checks adapter + load_in_4bit/load_in_8bit to also enforce that quantize_moe_experts can only be true when the runtime backend is CUDA (reject when backend is ROCm/other); specifically, in the same validator that references quantize_moe_experts and the load_in_4bit/load_in_8bit flags, add a check that the configured backend/device backend (or runtime torch backend detection) indicates CUDA and raise a ValidationError if quantize_moe_experts is true but the backend is not CUDA so the config fails validation early.

github-actions · 2026-02-27T09:40:43Z

📖 Documentation Preview: https://69a573539f289059635108bd--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit 3aafa4a

codecov · 2026-02-27T09:45:59Z

Codecov Report

❌ Patch coverage is 22.90749% with 175 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/monkeypatch/moe_quant.py	25.53%	70 Missing ⚠️
src/axolotl/monkeypatch/accelerate/fsdp2.py	0.00%	42 Missing ⚠️
src/axolotl/monkeypatch/fsdp2_qlora.py	0.00%	41 Missing ⚠️
src/axolotl/loaders/patch_manager.py	37.03%	17 Missing ⚠️
src/axolotl/core/trainers/base.py	0.00%	2 Missing ⚠️
src/axolotl/loaders/adapter.py	0.00%	1 Missing ⚠️
src/axolotl/loaders/model.py	66.66%	1 Missing ⚠️
src/axolotl/utils/schemas/config.py	91.66%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

winglian · 2026-02-27T15:28:27Z

+
+- **FSDP VRAM**: FSDP2 may use more VRAM per GPU than single GPU training. We suspect not all layers are properly sharded across ranks.
+- **FSDP initial spike**: FSDP LoRA (8-bit) may have a large initial VRAM spike at the first 1-2 steps that then drops. FSDP QLoRA (4-bit) does not exhibit this.
+- **cpu_ram_efficient_loading**: Must be set to `false` with FSDP2 — causes `AttributeError: e_score_correction_bias is not an nn.Parameter` due to modeling source.


btw, can we log the stack trace for this and open an issue for fixing in axolotl?

winglian · 2026-03-03T15:05:23Z

    return state_dict


+def patch_peft_param_wrapper_for_fsdp2():


we should upstream this

I looked at this. This patch is created to resolve an issue that occurred from our quantize patch, so maybe doesn't make sense to be upstreamed.

NanoCode012 added 25 commits February 25, 2026 14:27

fix: saving clones state dict

0d0122c

fix: apply fix for only CP mode

eb13054

fix: add dropout check when using lora target param

4657cb7

fix: re-add patch from transformers PR #39866

91dae42

feat: add moe quant to test by ved

2fc60b9

fix: try match target param properly end with

ad4e1a5

fix: clear cache per param quant

593599a

fix: attempt on-load quantize experts instead of post-load

d3d6cb6

fix: attempt disable async load

1558436

chore: add log

4b2f568

chore: adjust log

ca822cd

fix: remove cuda alloc for moe and enable async load

6ad4b4e

chore: remove leftover logs

731d5dd

chore: add extra empty cache

c58eaaa

fix(doc): clarify support

c329b43

fix: handle fsdp2 for paramwrapper dtensor

88a48ef

feat: attempt to quant experts in 8bit mode too

f68d9f8

feat: attempt to release bf16 experts from vram

21b2dfe

feat: upgrade cce

5f6fcd1

fix: fsdp2 init_sharded_param load int8/uint4 dtensor as

e0eed75

require_grad=true on init

fix: remove unnecessary gc and empty cache

1d54518

Revert "fix: remove unnecessary gc and empty cache"

049f007

This reverts commit 1d54518.

fix: do not call full_tensor on non-dtensors

5c4d2e9

fix: attempt to address fsdp2 with quant exp high loss

ce19f1c

fix: attempt lora quant experts wrong dim

e5a6383

NanoCode012 changed the title ~~Fix: v5 moe with qlora / lora~~ Fix: v5 moe with qlora / lora and many misc fixes Feb 26, 2026

NanoCode012 added 3 commits February 26, 2026 22:00

fix: ensure require_grad patch applied for lora 8bit

68eff4c

fix: attempt lora 8bit fsdp2

1e93cce

fix: attribute access on save for lora 8bit fsdp2

f4e7b6b

NanoCode012 linked an issue Feb 27, 2026 that may be closed by this pull request

[ New model ] GLM 4.7 Flash #3370

Closed

5 tasks

NanoCode012 requested a review from winglian February 27, 2026 09:37

coderabbitai Bot reviewed Feb 27, 2026

View reviewed changes

NanoCode012 changed the title ~~Fix: v5 moe with qlora / lora and many misc fixes~~ Fix: quantize and target moe layers in transformers v5 for adapters and many misc fixes Feb 27, 2026

NanoCode012 added 8 commits February 27, 2026 17:13

chore: comments refactor; add guards

ece58ac

fix: guard using wrong key

cab1084

fix: mamba save does not accept main process param

f889d6f

fix: guard prevent double hook

45be53b

fix: move gc to upper scope

0b1e47c

chore: add comment on proxy forward patch

f4c4918

fix: add comment to clarify

e10de3d

feat: add test idempotency

5c7c0b4

winglian reviewed Feb 27, 2026

View reviewed changes

NanoCode012 added 3 commits March 2, 2026 17:46

fix: AttributeError: e_score_correction_bias is not an nn.Parameter

692fde0

fix: AttributeError: 'NoneType' object has no attribute 'to'

7016425

fix: update docs on cpu_ram_efficient_loading

3aafa4a

This was referenced Mar 2, 2026

[Bug FSDP2] AttributeError: e_score_correction_bias is not an nn.Parameter #3446

Closed

fix: qwen3-next to use fla causal-conv1d to support packing #3437

Merged

NanoCode012 added the scheduled_release This PR is slated for the upcoming release label Mar 3, 2026

winglian reviewed Mar 3, 2026

View reviewed changes

winglian approved these changes Mar 3, 2026

View reviewed changes

winglian merged commit 945c8ae into main Mar 3, 2026
19 of 20 checks passed

winglian deleted the fix/v5-moe-qlora branch March 3, 2026 15:06

NanoCode012 mentioned this pull request Mar 4, 2026

add: qwen 3.5 #3442

Merged

kirawi mentioned this pull request Mar 7, 2026

Bnb4bit support for MoE models on transformers v5 unslothai/unsloth#4032

Open

winglian removed the scheduled_release This PR is slated for the upcoming release label Mar 22, 2026

coderabbitai Bot mentioned this pull request Mar 22, 2026

feature: Mixlora addition #3535

Open

coderabbitai Bot mentioned this pull request Apr 2, 2026

feat: add sonicmoe fused lora support #3519

Merged

	- You can run a full finetuning by removing the `adapter: qlora` and `load_in_4bit: true` from the FSDP2 config. This is heavy, so we have not tested this.
	- You can run a full finetuning by removing `adapter: qlora`, removing `load_in_4bit: true`, and setting `quantize_moe_experts: false` in the FSDP2 config. This is heavy, so we have not tested this.

Uh oh!

Conversation

NanoCode012 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How to use

Results

Limitation

Misc:

How to use

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

winglian Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

winglian Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NanoCode012 commented Feb 26, 2026 •

edited

Loading

coderabbitai Bot commented Feb 26, 2026 •

edited

Loading

github-actions Bot commented Feb 27, 2026 •

edited

Loading

codecov Bot commented Feb 27, 2026 •

edited

Loading