[NVIDIA] Support TF32 matmul to improve MiniMax gate gemm performance#22744
Open
trevor-m wants to merge 1 commit intosgl-project:mainfrom
Open
[NVIDIA] Support TF32 matmul to improve MiniMax gate gemm performance#22744trevor-m wants to merge 1 commit intosgl-project:mainfrom
trevor-m wants to merge 1 commit intosgl-project:mainfrom
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
|
@trevor-m Instead of adding a new argument that globally enables this feature, can we enable the tf32 matmul flag only for minimax model, since it has only been verified on this model? |
Collaborator
Author
|
Hi @Fridge003 This feature can benefit any model with fp32 gemms so I think it's nice to have it as an optional flag, it's currently off by default. Users can enable it if they test the accuracy is good. For minimax, do you want it enabled by default? |
thanhhao98
pushed a commit
to thanhhao98/sglang
that referenced
this pull request
Apr 25, 2026
Stacks on top of the previous commit (default routed-MoE). Calls
torch.set_float32_matmul_precision("high") in the same Glm4MoeForCausalLM
sm100 auto-default block, so any GLM-4.7-NVFP4 launch on Blackwell
gets TF32 tensor-core path for the FP32 router gemm (5120 -> 160).
This is a port of the pending sgl-project#22744 ("[NVIDIA]
Support TF32 matmul to improve MiniMax gate gemm performance"). On
Minimax-M2.5: +7% output throughput, -8% latency at batch=64;
FP32 router gemm 9.1% -> 3.3% of decode time; GPQA accuracy preserved.
GLM-4.7 has the same router topology (5120 -> N_experts FP32 cast)
so the same gain should apply. Bench data to follow as optimal_v2.
The FP32 cast from PR sgl-project#21660 still happens upstream of the matmul;
this changes only the matmul kernel to use TF32 tensor cores. Gate
with the existing GSM8K accuracy CI before merging.
thanhhao98
pushed a commit
to thanhhao98/sglang
that referenced
this pull request
Apr 25, 2026
The previous attempt (commit d6a435b) called torch.set_float32_matmul_precision("high") in ServerArgs.__post_init__, which runs in the parent process only. SGLang's worker processes are spawned (not forked), so PyTorch state set in the parent does not propagate. Bench data confirmed v2 was a no-op: v1 vs v2 nvfp4_tp8 throughput at all 10 cc points: |delta| < 1% Microbench inside the v2 image confirmed TF32 is functional (3.18x speedup on the gate matmul shape) but only after explicit setting in the same process. Hence each TP rank's worker must call set_float32_matmul_precision itself. This commit moves the call into Glm4MoeGate.__init__, gated by a class-level _tf32_set flag so it fires exactly once per worker. The parent-side call from d6a435b is left in place for redundancy (harmless, idempotent). Glm4MoeGate is the only class that performs an FP32 matmul in the GLM-4.7 forward graph (see lines 372-375 — the FP32-cast gate projection per PR sgl-project#21660). The TF32 setting therefore has zero scope beyond what's needed; no other FP32 matmuls get accelerated. Expected gain on Blackwell sm100: ~5-7% throughput at high cc per PR sgl-project#22744's Minimax-M2.5 measurement, mediated by what fraction of decode is the gate gemm. To be measured as optimal_v3.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Before this change, the fp32 gate gemm takes 9.1% of e2e decode time for MiniMax-M2.5 at bs 64. With
--enable-tf32-matmul, it is reduced to 3.3%.Modifications
Use
torch.set_float32_matmul_precision('high')to use TF32 as internal computation type for FP32 matmuls when available. This improves performance without affecting accuracy as much.See torch docs: https://docs.pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html
Accuracy Tests
GPQA
Speed Tests and Profiling
Before
After
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci