[WIP][MoE] Gpt oss moe kernels by Datta0 · Pull Request #447 · unslothai/unsloth-zoo

Datta0 · 2026-01-26T09:49:28Z

Please take a look at #396 first and then this

…rnels

This reverts commit 169b1ea.

gemini-code-assist · 2026-01-26T09:50:04Z

Summary of Changes

Hello @Datta0, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a suite of significant enhancements aimed at optimizing Mixture of Experts (MoE) models within the Unsloth framework. It integrates advanced 4-bit quantization techniques, leverages high-performance grouped GEMM kernels for faster operations, and refines the handling of LoRA adapters for MoE architectures. The changes also include model-specific optimizations for Qwen3-VL-MoE, provide greater control over MXFP4 quantization, and add support for Generative Reinforcement Learning from Pairwise Optimization (GRPO) training. These improvements collectively boost efficiency, reduce memory consumption, and expand the utility of MoE models.

Highlights

MoE 4-bit Quantization: Introduced bitsandbytes-style 4-bit quantization for Mixture of Experts (MoE) layers, enabling on-the-fly quantization and seamless integration with model loading processes. This addresses the challenge of quantizing MoE layers that use nn.Parameter tensors instead of standard nn.Linear modules.
Optimized MoE Backends: Implemented high-performance MoE forward passes utilizing torch._grouped_mm (native PyTorch) and Unsloth's Triton grouped GEMM kernels. The system now dynamically selects the most efficient backend based on availability and environment variables, significantly accelerating inference and training.
Enhanced LoRA for MoE: Improved LoRA support for MoE layers by patching PEFT's ParamWrapper. This allows for handling separated LoRA weights, with specific logic to correctly process both standard and transposed weight formats found in different MoE architectures (e.g., Qwen3-VL-MoE).
Qwen3-VL-MoE Specific Optimizations: Added specialized handling for Qwen3-VL-MoE models, including patching the __init__ method of Qwen3VLMoeTextExperts to initialize expert weights in a grouped_mm compatible transposed format. This ensures efficient loading and computation for these specific visual language models.
MXFP4 Quantization Flexibility: Introduced new configuration options for MXFP4 quantization. Users can now choose to keep MXFP4 weights quantized (without dequantization to bf16) if triton_kernels is available, offering improved memory efficiency and performance for compatible hardware.
GRPO Training Support: Patched the forward methods of Qwen3MoeForCausalLM and Qwen3VLMoeForConditionalGeneration to optionally return hidden states instead of logits. This feature is crucial for facilitating Generative Reinforcement Learning from Pairwise Optimization (GRPO) training workflows.
vLLM Integration Improvements: Made minor enhancements to vLLM integration, including a more robust import mechanism for vllm_version and the addition of an explicit lora_request_id parameter to the load_lora function for better management of LoRA requests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant enhancements for Mixture of Experts (MoE) models, particularly for Qwen3 and Qwen3-VL architectures. Key changes include the implementation of 4-bit quantization for MoE layers using bitsandbytes, integration of grouped GEMM kernels (both native PyTorch and Triton) for optimized forward passes, and a robust patching mechanism for PEFT's ParamWrapper to support separated LoRA for MoE. Additionally, the changes improve vLLM integration by adding vllm_config propagation and refining vllm_version detection. The introduction of MXFP4 configuration for training, including conditional dequantization based on triton_kernels availability, further enhances the quantization capabilities. Overall, these changes aim to improve performance, memory efficiency, and compatibility for MoE models within the Unsloth ecosystem.

gemini-code-assist · 2026-01-26T09:52:35Z

+            gate_up_weight = gate_up_proj[expert_idx].data.clone()  # [2*I, H]
+            down_weight = down_proj[expert_idx].data.clone()  # [H, I]


The use of .data is deprecated in PyTorch. It's recommended to use .detach().clone() instead to avoid potential issues with autograd and ensure proper tensor handling.

Suggested change

gate_up_weight = gate_up_proj[expert_idx].data.clone() # [2*I, H]

down_weight = down_proj[expert_idx].data.clone() # [H, I]

gate_up_weight = gate_up_proj[expert_idx].detach().clone() # [2*I, H]

down_weight = down_proj[expert_idx].detach().clone() # [H, I]

gemini-code-assist · 2026-01-26T09:52:35Z

+            os.makedirs(UNSLOTH_COMPILE_LOCATION)
+        except:
+            pass
+


The try...except block is too broad. It's generally better to catch specific exceptions like OSError or IOError for file operations, rather than a generic Exception. This helps in debugging and understanding the root cause of failures.

Suggested change

os.makedirs(UNSLOTH_COMPILE_LOCATION)

except:

pass

try:

os.makedirs(UNSLOTH_COMPILE_LOCATION)

except OSError:

pass

gemini-code-assist · 2026-01-26T09:52:35Z

+            pass
+
+
+install_to_cache(__file__, "moe_utils.py")


Executing install_to_cache(__file__, "moe_utils.py") directly at the module level means this code runs every time moe_utils.py is imported. While this might be intended for setup, it can lead to unexpected side effects or performance overhead in certain scenarios (e.g., repeated imports, testing environments). Consider wrapping this call in a function or a conditional block (e.g., if __name__ == "__main__":) if it's meant for a specific initialization step, or add a comment explaining why it's necessary to run at import time.

gemini-code-assist · 2026-01-26T09:52:35Z

+    except Exception:
+        _TORCH_GROUPED_MM_SUPPORTED = False


The try...except Exception block is too broad. For a runtime check like this, it would be more precise to catch specific exceptions that torch._grouped_mm might raise, such as RuntimeError or AttributeError, to avoid masking other potential issues.

Suggested change

except Exception:

_TORCH_GROUPED_MM_SUPPORTED = False

except (RuntimeError, AttributeError):

_TORCH_GROUPED_MM_SUPPORTED = False

gemini-code-assist · 2026-01-26T09:52:35Z

+    except Exception:
+        pass


The try...except Exception block is too broad. When dealing with external libraries like Triton, it's usually more appropriate to catch ImportError or ModuleNotFoundError if the issue is related to the library's availability, or specific runtime errors if the issue is with its usage.

Suggested change

except Exception:

pass

except ImportError:

pass

gemini-code-assist · 2026-01-26T09:52:35Z

+    except Exception:
+        return None


The try...except Exception block is too broad. When extracting LoRA weights, it's better to catch specific exceptions that might occur during attribute access or tensor manipulation, such as AttributeError, KeyError, or IndexError, to provide more targeted error handling and debugging.

gemini-code-assist · 2026-01-26T09:52:36Z

+            forward_native_grouped_mm = moe_utils.forward_native_grouped_mm
+            forward_triton_grouped_gemm = moe_utils.forward_triton_grouped_gemm
+            forward_native_moe_loop = moe_utils.forward_native_moe_loop


Importing forward_native_grouped_mm, forward_triton_grouped_gemm, and forward_native_moe_loop locally within the old_forward function is unusual. While it works, it can make the code harder to read and potentially lead to subtle scoping issues or unexpected behavior if the module structure changes. It's generally better practice to place imports at the top of the module or function scope where they are first needed, rather than inside a dynamically patched function. Consider moving these imports to the module level if they are consistently used across different patched functions.

gemini-code-assist · 2026-01-26T09:52:36Z

+            forward_native_grouped_mm = moe_utils.forward_native_grouped_mm
+            forward_triton_grouped_gemm = moe_utils.forward_triton_grouped_gemm
+            forward_native_moe_loop = moe_utils.forward_native_moe_loop


Similar to the old_forward function, importing forward_native_grouped_mm, forward_triton_grouped_gemm, and forward_native_moe_loop locally within this forward function is not ideal. It's generally better practice to place imports at the top of the module or function scope where they are first needed. For dynamically patched functions, this pattern might be a workaround, but it could impact readability and maintainability. If these functions are meant to be globally available to the patched methods, module-level imports would be clearer.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 60350d5958

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-26T09:55:34Z

+            # Get dimensions from the original module
+            gate_up_proj = experts.gate_up_proj
+            num_experts = gate_up_proj.shape[0]
+            intermediate_dim = gate_up_proj.shape[1] // 2
+            hidden_dim = gate_up_proj.shape[2]


Fix MoE 4-bit dim inference for Qwen3-VL layout

The quantized module infers intermediate_dim and hidden_dim from gate_up_proj.shape[1]/[2], which assumes the standard (E, 2I, H) layout. This same commit patches Qwen3‑VL MoE to store weights in grouped_mm format (E, H, 2I) in qwen3_vl_moe.py (lines 189–231), so for Qwen3‑VL this calculation swaps dimensions (intermediate_dim becomes H/2 and hidden_dim becomes 2*I). The result is a quantized module with wrong shapes, which will fail to load weights or produce invalid outputs. Consider detecting the transposed layout or using config.hidden_size/moe_intermediate_size instead of raw tensor dims.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-26T09:55:34Z

+            # Save weight data
+            gate_up_prefix = f"{prefix}gate_up_projs.{expert_idx}."
+            down_prefix = f"{prefix}down_projs.{expert_idx}."
+
+            destination[f"{gate_up_prefix}weight"] = (


Make 4-bit MoE save/load key scheme consistent

The custom _save_to_state_dict writes weights under gate_up_projs.*/down_projs.*, but _load_from_state_dict only recognizes the stacked gate_up_proj/down_proj keys and otherwise defers to the default loader (which expects _bnb_gate_up_weights.*). A checkpoint saved by this module will therefore reload with missing quantized weights (keys become unexpected and the ParameterLists stay empty), breaking inference on reload. The save/load key names need to align so round‑trip works.

Useful? React with 👍 / 👎.

Datta0 · 2026-02-03T08:34:37Z

Closing in favor of #450

Datta0 added 30 commits December 12, 2025 06:17

[WIP] fix for qwen3 moe torch compile issue

730c598

faster forward passes

f7967d3

Cleanup

8e7dc5c

[WIP] Use unsloth triton kernels

b38a20e

Perf go brrr sad memory

5a8adbb

clear cache after autotune

9cb5c88

Efficient

f2cda49

Fix tensor is none check

e93c67d

torch.grouped_mm

32ffa85

Adapt to qwen3-vl-moe

32bf9a4

fix qwen3-moe bugs

2fee30c

cleanup

44f4eda

cleanup

820be05

cleanup

3d7cc2c

Fix issues with triton

7dc9e17

Merge branch 'nightly' into qwen3_moe_kernels

3383613

refactor lora request handling

699945f

Merge remote-tracking branch 'origin/main' into qwen3_moe_kernels

2d30faa

Merge remote-tracking branch 'datta0/vllm_lora_req' into qwen3_moe_ke…

1ef4d0e

…rnels

fixup qwen3_vl_moe training

225b285

grouped_mm for H100 or higher

bbc8072

contiguous for triton

b502a48

cleanup

ded4bf2

rework triton import logic

dcd5a14

indentation fix

ed37079

Explicit tensor handling

9382b6f

rework operations to suit newer transformers v5

5225caa

GRPO fixes

4aa9bd6

grouped_mm forward check :)

169b1ea

Revert "grouped_mm forward check :)"

ee88018

This reverts commit 169b1ea.

Datta0 added 19 commits January 15, 2026 15:10

[WIP] MXFP4 grouped_mm

776d7e7

[WIP] MXFP4 grouped_mm dequantize mxfp4

b672d3c

[WIP] cleanup

7f276fb

Some more optimisations

065331c

Merge branch 'qwen3_moe_kernels' into gpt_oss_moe_kernels

7cce0ac

Update GPT oss patches MoE LoRA

eb65a71

Cleanup

7078b24

Merge remote-tracking branch 'origin/nightly' into gpt_oss_moe_kernels

3932a89

Merge remote-tracking branch 'origin/nightly' into qwen3_moe_kernels

0d593cd

logger fix

75c653c

Imports and cache folder handling

00351ea

Merge branch 'qwen3_moe_kernels' into gpt_oss_moe_kernels

152943e

Cleanup and fix duplications

385ac8f

Fix lora layouts

2ec2fe2

Qwen3 VL fixes

c358442

Qwen3 VL fixes

bc1a645

Add back qwen3moe and qwen3vlmoe to unsloth_compiled_cache

4596e92

No double compile

e92658d

Merge branch 'qwen3_moe_kernels' into gpt_oss_moe_kernels

60350d5

Datta0 changed the title ~~[MoE] Gpt oss moe kernels~~ [WIP][MoE] Gpt oss moe kernels Jan 26, 2026

gemini-code-assist Bot reviewed Jan 26, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jan 26, 2026

View reviewed changes

Datta0 added 3 commits January 26, 2026 14:22

Cleanup

3df553a

Merge remote-tracking branch 'origin/nightly' into qwen3_moe_kernels

709b92a

Merge branch 'qwen3_moe_kernels' into gpt_oss_moe_kernels

3298898

Datta0 mentioned this pull request Jan 29, 2026

[MoE] Qwen3MoE, Qwen3VLMoE, GPT OSS, Glm 4.7, DeepseekV3 MoE kernels 🚀 #450

Merged

Datta0 closed this Feb 3, 2026

Datta0 mentioned this pull request Feb 6, 2026

[FIX] Qwen3 moe torch compile issue #381

Closed

		gate_up_weight = gate_up_proj[expert_idx].data.clone() # [2*I, H]
		down_weight = down_proj[expert_idx].data.clone() # [H, I]

Conversation

Datta0 commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Jan 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Datta0 commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Datta0 commented Jan 26, 2026 •

edited

Loading