Skip to content

tests: CPU regression detectors for the MoE merge / save path (#5410)#649

Merged
danielhanchen merged 1 commit into
fix-5410-moe-merge-layoutfrom
ci-pr5410-regression-detectors
May 15, 2026
Merged

tests: CPU regression detectors for the MoE merge / save path (#5410)#649
danielhanchen merged 1 commit into
fix-5410-moe-merge-layoutfrom
ci-pr5410-regression-detectors

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Adds four CPU-only signals so the upstream-drift matrix catches the class of bug that produced unslothai/unsloth#5410 before it ships. Stacked on top of #647: the two _MOE_MERGE_STATE / _detect_moe_lora_layout symbols only exist after #647 lands. Base will retarget to main once #647 merges.

Why CI did not catch the original bug

unsloth-zoo/.github/workflows/consolidated-tests-ci.yml already runs a three-cell drift matrix:

cell transformers peft
HF=4.57.6 + TRL<1 transformers==4.57.6 peft>=0.18,<0.20
HF=latest + TRL=latest transformers>=5,<6 peft (latest)
HF=default + TRL=default from pyproject from pyproject

This is the right matrix to catch #5410. But the pytest invocation only listed 12 drift-detector files; tests/test_unsloth_zoo_lora_merge.py (the only file that actually covers the merge helpers) was never run inside the matrix. So peft 0.19 could land in the matrix without exercising the helpers that broke.

What this PR adds

1. Re-route the merge battery through the matrix. One-line workflow change at line ~234 of consolidated-tests-ci.yml adds tests/test_unsloth_zoo_lora_merge.py (already 22 cases covering both PEFT layouts, the fallback counter, and the resolver walk, after #647) plus the three new files below to the matrix pytest list.

2. tests/test_peft_paramwrapper_layout_drift.py (45 lines). Builds a 4-line nn.Module with a single fused 3D nn.Parameter, attaches a real peft LoRA via target_parameters=[\"gate_up_proj\"], then asserts the resulting lora_A / lora_B shapes match either the swapped or standard signature. A third layout convention from a future PEFT release fires this test on the day the new pin lands in the matrix, with a message naming both the observed and expected shapes.

3. tests/test_transformers_moe_structure_drift.py (~120 lines). Imports each tracked MoE block class (Qwen3MoeSparseMoeBlock, MixtralSparseMoeBlock) from the installed transformers, instantiates a tiny config, and pins whether the experts container is a fused 3D nn.Parameter (fused_3d) or per-expert nn.ModuleList (module_list). The same kind of refactor that flipped Qwen3MoE from ModuleList to fused 3D between 4.57.x and 5.x will fire this test on the next refactor.

4. tests/test_moe_merge_e2e_cpu.py (~180 lines, 6 parametrized cases). Drives the per-expert merge helpers in the same order the inner shard loop does, for a synthetic 2-layer / 4-expert state dict, in both PEFT layouts. Asserts:

  • All 3 * E * L tensors receive the analytic delta (sub-1e-4 max error).
  • _MOE_MERGE_STATE[\"applied\"] matches the expected apply count and \"fallback\" is 0.
  • Unrecognised layout shapes record a fallback and populate first_error.
  • The resolver walks past an outer ParamWrapper with .module = None to read num_experts off the inner wrapper (this is the exact regression that produced 935 silent fallbacks before the resolver fix in saving: layout-aware MoE LoRA merge + loud-fail on fallback (#5410) #647).
  • A module.base_layer is module self-cycle terminates within the 16-hop bound.
  • _detect_moe_lora_layout returns \"unknown\" for a shard-local num_experts=17 against total_rank=128*r, so the per-expert helper records a fallback rather than producing a silently-wrong slice.

Local results

33 passed in 0.33s

Also confirmed that on pre-#647 main, test_moe_merge_e2e_cpu.py fails to import (ImportError: cannot import name '_MOE_MERGE_STATE'), proving the detector triggers if any future PR weakens or removes the loud-fail guard.

Cost

All four files are pure CPU, no model download, no GPU. Per matrix cell: ~33 extra tests in well under 2 seconds. Three cells, so under 6 seconds added to the total CI wall time.

Test plan

  • pytest tests/test_unsloth_zoo_lora_merge.py tests/test_peft_paramwrapper_layout_drift.py tests/test_transformers_moe_structure_drift.py tests/test_moe_merge_e2e_cpu.py -> 33 / 33 PASS
  • Reverted unsloth_zoo/saving_utils.py to pre-saving: layout-aware MoE LoRA merge + loud-fail on fallback (#5410) #647 main locally; confirmed the e2e file fails to import and the detector class as a whole correctly goes red
  • CI matrix on this PR exercises the four files across the three combos

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52896d4ac6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +97 to +99
peft_model = get_peft_model(base, cfg)
except Exception as e:
pytest.skip(f"PEFT failed to wrap fused 3D param on this build: {e}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail when PEFT can no longer wrap fused params

Because this file is now part of the hard upstream-regression matrix, skipping every get_peft_model exception masks exactly the PEFT drift this detector is supposed to catch: once target_parameters is accepted, a future PEFT release that changes or breaks fused-3D ParamWrapper creation will make the test report skipped instead of failing. The earlier LoraConfig(... target_parameters=...) probe already handles unsupported pre-0.18 builds, so wrapping failures in supported matrix cells should be assertions/failures rather than skips.

Useful? React with 👍 / 👎.

Comment on lines +132 to +133
accepted = {"fused_3d", "module_list"}
assert kind in accepted, (

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin the expected MoE container kind

This accepts both fused_3d and module_list, so a transformers update that flips a tracked architecture between those two layouts remains green even though the test description says that exact flip should be caught. Since the workflow now relies on this as a drift detector, the expected kind needs to be version/architecture-specific (or otherwise asserted) rather than allowing both supported layouts unconditionally.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces three new test suites designed to validate Mixture of Experts (MoE) merging logic and detect structural drifts in external dependencies. The added tests cover end-to-end CPU regression for LoRA delta applications, PEFT ParamWrapper layout changes, and Transformers MoE expert container refactors. Feedback suggests moving the manual seed initialization out of helper functions to increase test data diversity and utilizing pytest fixtures for more reliable global state management.

def _build_synthetic_layer(num_experts: int, rank_per: int, hidden: int,
intermediate: int, layout: str, alpha: float, dtype=torch.float32):
"""Make: fused base weights + per-expert disk shards + LoRA tensors."""
torch.manual_seed(SEED)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling torch.manual_seed(SEED) inside _build_synthetic_layer resets the random number generator state every time a layer is built. In test_per_layer_merge_round_trip, this results in all layers having identical weights and LoRA matrices, which reduces the diversity of the test data. It is recommended to set the seed once at the start of the test function instead so that subsequent calls to this helper generate different random values.

Suggested change
torch.manual_seed(SEED)
# torch.manual_seed(SEED)

alpha = 8.0
dtype = torch.float32

_reset_moe_merge_state()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test manually resets the global _MOE_MERGE_STATE at the beginning and end of the function. To ensure the state is always cleaned up (even if an assertion fails) and to follow pytest best practices, consider using a fixture with autouse=True for state management. This improves test isolation and reliability, especially if tests are ever run in parallel within the same process.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec86efdefb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +80 to +83
try:
mod = __import__(module_path, fromlist=[block_cls_name, cfg_cls_name])
except Exception as e:
pytest.skip(f"{module_path} not importable: {e}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail when a tracked MoE module stops importing

When a tracked architecture's import path breaks or is renamed while another tracked module still imports, this skips the broken parametrized case and test_at_least_one_tracked_moe_arch_imports still passes because it only requires one successful import. In the matrix you just added this file to, that means a Qwen3/Mixtral upstream drift can disappear silently instead of failing the canary; treat failures for expected tracked entries as assertions or gate skips by known unsupported transformer versions.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen force-pushed the fix-5410-moe-merge-layout branch from b0112e5 to 6b0f75e Compare May 15, 2026 07:17
Adds four CPU-only signals so the upstream drift matrix catches the
class of bug that produced unslothai/unsloth#5410 before it ships:

1. Add tests/test_unsloth_zoo_lora_merge.py to the drift-matrix pytest
   invocation (consolidated-tests-ci.yml). The 22 helper-level merge
   tests already cover both PEFT 0.18 swapped and PEFT 0.19+ standard
   layouts, the fallback counter, and the resolver walk; they were
   only ever exercised in the default cell. Now they fire across all
   three matrix combos (HF 4.57.6 + TRL<1, HF latest + TRL latest,
   pyproject defaults).

2. tests/test_peft_paramwrapper_layout_drift.py - constructs a 4-line
   nn.Module with a single fused 3D nn.Parameter, attaches a real
   peft LoRA via target_parameters, and asserts the lora_A / lora_B
   shapes match either the swapped or standard signature. A third
   layout convention from a future PEFT release fires this test on
   the day the new pin lands in the matrix.

3. tests/test_transformers_moe_structure_drift.py - imports each
   tracked MoE block class (Qwen3MoeSparseMoeBlock, MixtralSparseMoeBlock)
   from the installed transformers, instantiates a tiny config, and
   pins whether the experts container is fused 3D nn.Parameter or
   per-expert nn.ModuleList. The same kind of refactor that flipped
   Qwen3MoE from ModuleList to fused 3D between 4.57.x and 5.x will
   fire this test on the next refactor.

4. tests/test_moe_merge_e2e_cpu.py - drives the per-expert merge
   helpers in the same order the inner shard loop does, for a
   synthetic 2-layer / 4-expert state dict, in both PEFT layouts.
   Asserts all 3 * E * L tensors receive the analytic delta, the
   _MOE_MERGE_STATE counter records each apply, fallbacks land on
   unrecognised shapes, the resolver walks the outer ParamWrapper
   chain, and a self-cycle in `module.base_layer` terminates.

All four files are pure CPU, sub-second, no model download. Total
matrix overhead: <2s per cell, three cells, 33 tests.

Verified locally (33 / 33 PASS) and confirmed that the e2e file fails
to import on pre-#5410 main (ImportError: _MOE_MERGE_STATE), proving
the detector triggers if any future PR weakens the loud-fail guard.
@danielhanchen danielhanchen force-pushed the ci-pr5410-regression-detectors branch from ec86efd to c5c7a86 Compare May 15, 2026 07:20
@danielhanchen danielhanchen merged commit 0da895c into fix-5410-moe-merge-layout May 15, 2026
7 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant