[MoE] Undo GPT OSS MoE patch temporarily by Datta0 · Pull Request #470 · unslothai/unsloth-zoo

Datta0 · 2026-02-05T16:34:06Z

Picked the gpt oss code from the place where it works. This loads on colab now

gemini-code-assist · 2026-02-05T16:34:41Z

Summary of Changes

Hello @Datta0, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a temporary rollback and modification of previous patches affecting the GPT OSS Mixture of Experts (MoE) functionality. The primary objective is to resolve loading issues, particularly on Colab, by reverting to a more stable expert implementation and simplifying MXFP4 quantization integration. Additionally, specific LoRA patching for MoE has been temporarily removed, suggesting ongoing work or re-evaluation of its integration.

Highlights

MoE Expert Implementation Reversion: The GptOssExperts class has been reverted from using a custom ParameterModule for 3D weights to a more standard nn.ModuleList of nn.Linear layers, simplifying the expert architecture.
Temporary Removal of MoE LoRA Patching: All specific LoRA patching logic for GPT OSS Mixture of Experts (MoE) has been removed, indicating a temporary rollback of this feature.
Simplified MXFP4 Handling: The logic for MXFP4 quantization patching has been streamlined, making the triton_kernels library a direct requirement for its functionality.
Overflow Prevention for MoE Experts: A temporary fix has been introduced in patch_gpt_oss_linearized to conditionally force float32 for certain intermediate computations within MoE experts, aiming to prevent numerical overflows.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

unsloth_zoo/temporary_patches/gpt_oss.py
- Reverted the GptOssExperts class to utilize nn.ModuleList for individual expert linear layers, replacing the ParameterModule approach for handling 3D weights.
- Eliminated all code related to LoRA-specific patching for GPT OSS MoE, including patch_gpt_oss_moe_for_lora, forward_mxfp4_gpt_oss_with_lora, and patch_mxfp4_gpt_oss_for_lora.
- Simplified the patch_gpt_oss function by removing conditional checks for triton_kernels availability, making its presence a prerequisite for MXFP4 patching.
- Modified patch_gpt_oss_linearized to introduce a conditional float32 cast for intermediate calculations in MoE experts when UNSLOTH_FORCE_FLOAT32 is enabled, addressing potential overflow issues.
- Applied @torch_compile to the forward method of GptOssTopKRouter and removed redundant property decorators.
- Updated moe_forward_inference and moe_forward_inference_bf16 to be consistent with the new nn.ModuleList-based expert structure.
- Streamlined the GptOssModel's forward method by removing logic for accumulating all_hidden_states and simplifying the return object.
- Performed minor formatting adjustments and removed unnecessary blank lines and import statements throughout the file.

Activity

The pull request was created by Datta0 to address a specific issue where the GPT OSS MoE patch was causing loading failures on Colab.
The changes represent a temporary rollback and modification to restore functionality.
No further activity such as comments or reviews is available in the provided context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2add2f95c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-05T16:35:59Z

            for decoder_layer in self.layers:
-                if output_hidden_states:
-                    all_hidden_states += (hidden_states,)
-
-                mask = (attention_mask[decoder_layer.attention_type] if isinstance(attention_mask, dict) else attention_mask)
+                mask = attention_mask[decoder_layer.attention_type] if isinstance(attention_mask, dict) else attention_mask
                hidden_states = decoder_layer(
                    hidden_states,
                    attention_mask=mask,


Restore 2D attention_mask expansion in training path

In the training branch you now pass attention_mask straight into each decoder layer (mask = attention_mask[...]) without expanding 2D masks. However, the attention implementation indexed later (attention_mask[:, :, :, : ...]) assumes a 4D causal mask. When callers provide the standard 2D padding mask (batch, seq) during training, this will raise an index error or produce incorrect broadcasting. The removed conversion block previously normalized 2D masks into the expected 4D shape, so training with typical attention_mask inputs will now break.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-05T16:35:59Z

        # Fix float16 / float32 mismatching
        hidden_states = hidden_states.to(inputs_embeds.dtype)
-        return process_return(
-            MoeModelOutputWithPast,
-            {
-                "last_hidden_state": hidden_states,
-                "past_key_values": past_key_values,
-                "hidden_states": all_hidden_states,
-            },
-        )
-
-    patch_function(transformers.models.gpt_oss.modeling_gpt_oss.GptOssModel, "forward", forward, match_level="relaxed")
-
-
+        return process_return(MoeModelOutputWithPast, {
+            "last_hidden_state" : hidden_states,
+            "past_key_values" : past_key_values,
+        })


Preserve output_hidden_states in GptOssModel output

The new return path always omits hidden_states and never accumulates them, even when output_hidden_states=True (either via kwargs or config). This silently drops data that downstream callers expect from MoeModelOutputWithPast, changing the model’s output contract and breaking features like diagnostics or adapters that rely on hidden states. The previous logic collected and returned hidden_states, but it is now removed.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request appears to be a significant refactoring and simplification of the GPT-OSS MoE patching logic, likely to temporarily resolve an issue as indicated by the title. Many complex features related to LoRA and MXFP4 have been removed or simplified. While the changes are extensive, I've identified a few areas for improvement: a redundant code block, a critical bug that could lead to a runtime error due to missing properties on a class, a potential numerical stability issue from a commented-out line of code, and a simplified validation check that might be less robust. Addressing these points will improve the robustness of this temporary patch.

I am having trouble creating individual review comments. Click here to see my feedback.

unsloth_zoo/temporary_patches/gpt_oss.py (524)

The weight and bias properties have been removed from GptOssTopKRouter. However, the moe_router_forward function, which is called by moe_forward_inference_bf16, relies on these properties being present on the router instance. This will cause a runtime error. Please consider adding these properties back for compatibility.


    @property
    def weight(self):
        return self.linear.weight

    @property
    def bias(self):
        return self.linear.bias

unsloth_zoo/temporary_patches/gpt_oss.py (836)

This line, which helps prevent overflow in BF16/FP16 by normalizing logits, has been commented out. This might re-introduce numerical stability issues during training with batch sizes greater than 1. If this was unintentional, it should be re-enabled to maintain model stability.

        combined_logits = combined_logits - combined_logits.max(dim=-1, keepdim=True).values

unsloth_zoo/temporary_patches/gpt_oss.py (104-107)

This try-except block appears to be redundant. The Mxfp4HfQuantizer.is_trainable attribute is already set to True within the try block at lines 93-98. This duplicate block can be removed to simplify the code.

unsloth_zoo/temporary_patches/gpt_oss.py (454-462)

The validation for blocks has been simplified, removing checks for blocks.numel() > 0 and blocks.any(). The previous implementation was more robust against processing initialized but unloaded tensors. While this change might be intentional for the temporary fix, it could lead to incorrect behavior if it processes zero-initialized tensors that are not on a 'meta' device. Consider if this simplified check is sufficient for all cases or if the more robust checks should be retained.

use old gpt

2add2f9

chatgpt-codex-connector Bot reviewed Feb 5, 2026

View reviewed changes

gemini-code-assist Bot reviewed Feb 5, 2026

View reviewed changes

danielhanchen merged commit ba585af into unslothai:main Feb 5, 2026

Datta0 mentioned this pull request Feb 16, 2026

Fix MoE wrapper handling for PR 450 #453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE] Undo GPT OSS MoE patch temporarily#470

[MoE] Undo GPT OSS MoE patch temporarily#470
danielhanchen merged 1 commit into
unslothai:mainfrom
Datta0:undo_gpt_oss

Datta0 commented Feb 5, 2026

Uh oh!

gemini-code-assist Bot commented Feb 5, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Feb 5, 2026

Uh oh!

chatgpt-codex-connector Bot Feb 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Datta0 commented Feb 5, 2026

Uh oh!

gemini-code-assist Bot commented Feb 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

unsloth_zoo/temporary_patches/gpt_oss.py (524)

unsloth_zoo/temporary_patches/gpt_oss.py (836)

unsloth_zoo/temporary_patches/gpt_oss.py (104-107)

unsloth_zoo/temporary_patches/gpt_oss.py (454-462)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants