Skip to content

[MoE] Undo GPT OSS MoE patch temporarily#470

Merged
danielhanchen merged 1 commit into
unslothai:mainfrom
Datta0:undo_gpt_oss
Feb 5, 2026
Merged

[MoE] Undo GPT OSS MoE patch temporarily#470
danielhanchen merged 1 commit into
unslothai:mainfrom
Datta0:undo_gpt_oss

Conversation

@Datta0

@Datta0 Datta0 commented Feb 5, 2026

Copy link
Copy Markdown
Collaborator

Picked the gpt oss code from the place where it works. This loads on colab now

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Datta0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a temporary rollback and modification of previous patches affecting the GPT OSS Mixture of Experts (MoE) functionality. The primary objective is to resolve loading issues, particularly on Colab, by reverting to a more stable expert implementation and simplifying MXFP4 quantization integration. Additionally, specific LoRA patching for MoE has been temporarily removed, suggesting ongoing work or re-evaluation of its integration.

Highlights

  • MoE Expert Implementation Reversion: The GptOssExperts class has been reverted from using a custom ParameterModule for 3D weights to a more standard nn.ModuleList of nn.Linear layers, simplifying the expert architecture.
  • Temporary Removal of MoE LoRA Patching: All specific LoRA patching logic for GPT OSS Mixture of Experts (MoE) has been removed, indicating a temporary rollback of this feature.
  • Simplified MXFP4 Handling: The logic for MXFP4 quantization patching has been streamlined, making the triton_kernels library a direct requirement for its functionality.
  • Overflow Prevention for MoE Experts: A temporary fix has been introduced in patch_gpt_oss_linearized to conditionally force float32 for certain intermediate computations within MoE experts, aiming to prevent numerical overflows.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth_zoo/temporary_patches/gpt_oss.py
    • Reverted the GptOssExperts class to utilize nn.ModuleList for individual expert linear layers, replacing the ParameterModule approach for handling 3D weights.
    • Eliminated all code related to LoRA-specific patching for GPT OSS MoE, including patch_gpt_oss_moe_for_lora, forward_mxfp4_gpt_oss_with_lora, and patch_mxfp4_gpt_oss_for_lora.
    • Simplified the patch_gpt_oss function by removing conditional checks for triton_kernels availability, making its presence a prerequisite for MXFP4 patching.
    • Modified patch_gpt_oss_linearized to introduce a conditional float32 cast for intermediate calculations in MoE experts when UNSLOTH_FORCE_FLOAT32 is enabled, addressing potential overflow issues.
    • Applied @torch_compile to the forward method of GptOssTopKRouter and removed redundant property decorators.
    • Updated moe_forward_inference and moe_forward_inference_bf16 to be consistent with the new nn.ModuleList-based expert structure.
    • Streamlined the GptOssModel's forward method by removing logic for accumulating all_hidden_states and simplifying the return object.
    • Performed minor formatting adjustments and removed unnecessary blank lines and import statements throughout the file.
Activity
  • The pull request was created by Datta0 to address a specific issue where the GPT OSS MoE patch was causing loading failures on Colab.
  • The changes represent a temporary rollback and modification to restore functionality.
  • No further activity such as comments or reviews is available in the provided context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2add2f95c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 1248 to 1252
for decoder_layer in self.layers:
if output_hidden_states:
all_hidden_states += (hidden_states,)

mask = (attention_mask[decoder_layer.attention_type] if isinstance(attention_mask, dict) else attention_mask)
mask = attention_mask[decoder_layer.attention_type] if isinstance(attention_mask, dict) else attention_mask
hidden_states = decoder_layer(
hidden_states,
attention_mask=mask,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore 2D attention_mask expansion in training path

In the training branch you now pass attention_mask straight into each decoder layer (mask = attention_mask[...]) without expanding 2D masks. However, the attention implementation indexed later (attention_mask[:, :, :, : ...]) assumes a 4D causal mask. When callers provide the standard 2D padding mask (batch, seq) during training, this will raise an index error or produce incorrect broadcasting. The removed conversion block previously normalized 2D masks into the expected 4D shape, so training with typical attention_mask inputs will now break.

Useful? React with 👍 / 👎.

Comment on lines 1262 to +1267
# Fix float16 / float32 mismatching
hidden_states = hidden_states.to(inputs_embeds.dtype)
return process_return(
MoeModelOutputWithPast,
{
"last_hidden_state": hidden_states,
"past_key_values": past_key_values,
"hidden_states": all_hidden_states,
},
)

patch_function(transformers.models.gpt_oss.modeling_gpt_oss.GptOssModel, "forward", forward, match_level="relaxed")


return process_return(MoeModelOutputWithPast, {
"last_hidden_state" : hidden_states,
"past_key_values" : past_key_values,
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve output_hidden_states in GptOssModel output

The new return path always omits hidden_states and never accumulates them, even when output_hidden_states=True (either via kwargs or config). This silently drops data that downstream callers expect from MoeModelOutputWithPast, changing the model’s output contract and breaking features like diagnostics or adapters that rely on hidden states. The previous logic collected and returned hidden_states, but it is now removed.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request appears to be a significant refactoring and simplification of the GPT-OSS MoE patching logic, likely to temporarily resolve an issue as indicated by the title. Many complex features related to LoRA and MXFP4 have been removed or simplified. While the changes are extensive, I've identified a few areas for improvement: a redundant code block, a critical bug that could lead to a runtime error due to missing properties on a class, a potential numerical stability issue from a commented-out line of code, and a simplified validation check that might be less robust. Addressing these points will improve the robustness of this temporary patch.

I am having trouble creating individual review comments. Click here to see my feedback.

unsloth_zoo/temporary_patches/gpt_oss.py (524)

critical

The weight and bias properties have been removed from GptOssTopKRouter. However, the moe_router_forward function, which is called by moe_forward_inference_bf16, relies on these properties being present on the router instance. This will cause a runtime error. Please consider adding these properties back for compatibility.


    @property
    def weight(self):
        return self.linear.weight

    @property
    def bias(self):
        return self.linear.bias

unsloth_zoo/temporary_patches/gpt_oss.py (836)

high

This line, which helps prevent overflow in BF16/FP16 by normalizing logits, has been commented out. This might re-introduce numerical stability issues during training with batch sizes greater than 1. If this was unintentional, it should be re-enabled to maintain model stability.

        combined_logits = combined_logits - combined_logits.max(dim=-1, keepdim=True).values

unsloth_zoo/temporary_patches/gpt_oss.py (104-107)

medium

This try-except block appears to be redundant. The Mxfp4HfQuantizer.is_trainable attribute is already set to True within the try block at lines 93-98. This duplicate block can be removed to simplify the code.

unsloth_zoo/temporary_patches/gpt_oss.py (454-462)

medium

The validation for blocks has been simplified, removing checks for blocks.numel() > 0 and blocks.any(). The previous implementation was more robust against processing initialized but unloaded tensors. While this change might be intentional for the temporary fix, it could lead to incorrect behavior if it processes zero-initialized tensors that are not on a 'meta' device. Consider if this simplified check is sufficient for all cases or if the more robust checks should be retained.

@danielhanchen danielhanchen merged commit ba585af into unslothai:main Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants