[MoE] Undo GPT OSS MoE patch temporarily#470
Conversation
Summary of ChangesHello @Datta0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a temporary rollback and modification of previous patches affecting the GPT OSS Mixture of Experts (MoE) functionality. The primary objective is to resolve loading issues, particularly on Colab, by reverting to a more stable expert implementation and simplifying MXFP4 quantization integration. Additionally, specific LoRA patching for MoE has been temporarily removed, suggesting ongoing work or re-evaluation of its integration. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2add2f95c9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for decoder_layer in self.layers: | ||
| if output_hidden_states: | ||
| all_hidden_states += (hidden_states,) | ||
|
|
||
| mask = (attention_mask[decoder_layer.attention_type] if isinstance(attention_mask, dict) else attention_mask) | ||
| mask = attention_mask[decoder_layer.attention_type] if isinstance(attention_mask, dict) else attention_mask | ||
| hidden_states = decoder_layer( | ||
| hidden_states, | ||
| attention_mask=mask, |
There was a problem hiding this comment.
Restore 2D attention_mask expansion in training path
In the training branch you now pass attention_mask straight into each decoder layer (mask = attention_mask[...]) without expanding 2D masks. However, the attention implementation indexed later (attention_mask[:, :, :, : ...]) assumes a 4D causal mask. When callers provide the standard 2D padding mask (batch, seq) during training, this will raise an index error or produce incorrect broadcasting. The removed conversion block previously normalized 2D masks into the expected 4D shape, so training with typical attention_mask inputs will now break.
Useful? React with 👍 / 👎.
| # Fix float16 / float32 mismatching | ||
| hidden_states = hidden_states.to(inputs_embeds.dtype) | ||
| return process_return( | ||
| MoeModelOutputWithPast, | ||
| { | ||
| "last_hidden_state": hidden_states, | ||
| "past_key_values": past_key_values, | ||
| "hidden_states": all_hidden_states, | ||
| }, | ||
| ) | ||
|
|
||
| patch_function(transformers.models.gpt_oss.modeling_gpt_oss.GptOssModel, "forward", forward, match_level="relaxed") | ||
|
|
||
|
|
||
| return process_return(MoeModelOutputWithPast, { | ||
| "last_hidden_state" : hidden_states, | ||
| "past_key_values" : past_key_values, | ||
| }) |
There was a problem hiding this comment.
Preserve output_hidden_states in GptOssModel output
The new return path always omits hidden_states and never accumulates them, even when output_hidden_states=True (either via kwargs or config). This silently drops data that downstream callers expect from MoeModelOutputWithPast, changing the model’s output contract and breaking features like diagnostics or adapters that rely on hidden states. The previous logic collected and returned hidden_states, but it is now removed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request appears to be a significant refactoring and simplification of the GPT-OSS MoE patching logic, likely to temporarily resolve an issue as indicated by the title. Many complex features related to LoRA and MXFP4 have been removed or simplified. While the changes are extensive, I've identified a few areas for improvement: a redundant code block, a critical bug that could lead to a runtime error due to missing properties on a class, a potential numerical stability issue from a commented-out line of code, and a simplified validation check that might be less robust. Addressing these points will improve the robustness of this temporary patch.
I am having trouble creating individual review comments. Click here to see my feedback.
unsloth_zoo/temporary_patches/gpt_oss.py (524)
The weight and bias properties have been removed from GptOssTopKRouter. However, the moe_router_forward function, which is called by moe_forward_inference_bf16, relies on these properties being present on the router instance. This will cause a runtime error. Please consider adding these properties back for compatibility.
@property
def weight(self):
return self.linear.weight
@property
def bias(self):
return self.linear.bias
unsloth_zoo/temporary_patches/gpt_oss.py (836)
This line, which helps prevent overflow in BF16/FP16 by normalizing logits, has been commented out. This might re-introduce numerical stability issues during training with batch sizes greater than 1. If this was unintentional, it should be re-enabled to maintain model stability.
combined_logits = combined_logits - combined_logits.max(dim=-1, keepdim=True).values
unsloth_zoo/temporary_patches/gpt_oss.py (104-107)
This try-except block appears to be redundant. The Mxfp4HfQuantizer.is_trainable attribute is already set to True within the try block at lines 93-98. This duplicate block can be removed to simplify the code.
unsloth_zoo/temporary_patches/gpt_oss.py (454-462)
The validation for blocks has been simplified, removing checks for blocks.numel() > 0 and blocks.any(). The previous implementation was more robust against processing initialized but unloaded tensors. While this change might be intentional for the temporary fix, it could lead to incorrect behavior if it processes zero-initialized tensors that are not on a 'meta' device. Consider if this simplified check is sufficient for all cases or if the more robust checks should be retained.
Picked the gpt oss code from the place where it works. This loads on colab now