Skip to content

[MoE] Qwen3MoE, Qwen3VLMoE, GPT OSS, Glm 4.7, DeepseekV3 MoE kernels 🚀#450

Merged
danielhanchen merged 98 commits into
unslothai:mainfrom
Datta0:glm47_moe_kernels
Feb 5, 2026
Merged

[MoE] Qwen3MoE, Qwen3VLMoE, GPT OSS, Glm 4.7, DeepseekV3 MoE kernels 🚀#450
danielhanchen merged 98 commits into
unslothai:mainfrom
Datta0:glm47_moe_kernels

Conversation

@Datta0

@Datta0 Datta0 commented Jan 29, 2026

Copy link
Copy Markdown
Collaborator

This PR extensively improves fine tuning performance for the above mentioned MoE models, but is reliant on some changes that are integral to transformers V5.

PS: If we want to use triton kernels or grouped_mm as mentioned here, we need the changes in unsloth

Note that along with speed improvements, I also observed memory usage improvements wherein the grouped_mm was able to do a 8192 sequence length fine-tuning on H100 in 16-bit LoRA, but the same was not true for pure PyTorch code which threw OOMs

Extensive benchmarks and release blog

image

Transformers v4 unsloth latest release
image

Transformers v5 + pure pytorch
image

Transformers v5 + grouped_mm
image

Transformers v5 + unsloth triton kernels
image

Previous PRs: #396 #447

@Datta0 Datta0 changed the title [MoE] Glm 4.7 moe kernels [MoE] Qwen3MoE, Qwen3VLMoE, GPT OSS, Glm 4.7, DeepseekV3 moe kernels 🚀 Feb 3, 2026
@Datta0 Datta0 changed the title [MoE] Qwen3MoE, Qwen3VLMoE, GPT OSS, Glm 4.7, DeepseekV3 moe kernels 🚀 [MoE] Qwen3MoE, Qwen3VLMoE, GPT OSS, Glm 4.7, DeepseekV3 MoE kernels 🚀 Feb 3, 2026
This was referenced Feb 3, 2026
danielhanchen and others added 18 commits February 3, 2026 11:12
Two fixes:
1. Early return when HAS_TRITON_KERNELS=False to skip MXFP4 patches gracefully
2. Move mlp_forward inside if HAS_TRITON_KERNELS block since it uses routing variable
The file already has AGPLv3 license header at top, so inline comments
are unnecessary for trivial 1-line functions:
- native_moe_grouped_mm()
- _should_use_separated_lora()
- register_weight_preprocessor()
- get_weight_preprocessor()
This reverts commit 690f25ede162777ace69f08dbf7fe83bbc3a4db5.
This reverts commit e9dddc3597b2dd333b10278951591c07aa811fa5.
Removed random AI mat muls and lora extractions that slowed down entire MoE forward pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants