Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510
Open
SelfAnush wants to merge 4 commits intoopenai:mainfrom
Open
Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510SelfAnush wants to merge 4 commits intoopenai:mainfrom
SelfAnush wants to merge 4 commits intoopenai:mainfrom
Conversation
… slow TRSM on H100 Non-record: MUD optimizer (arxiv:2603.17970) Replaces Muon's 5-step Newton-Schulz with MUD's triangular Gram preconditioning. Single seed (42) on 8xH100 SXM. Results: - val_bpb: 1.1989 (sliding window eval, stride=64) - Steps: 5,087 in 10 min - step_avg: 118ms (4.5x slower than Muon's ~26ms on H100) Key finding: Strong convergence (within 0.056 BPB of SOTA with 4x fewer steps) but TRSM overhead on H100 CUDA negates the 12x FLOP savings reported in the paper (tested on A100/MI250/GH200). Built on SOTA by @thwu1 (PR openai#180). Paper: https://arxiv.org/abs/2603.17970
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces Muon's 5-step Newton-Schulz iteration with MUD's triangular Gram
preconditioning (Algorithm 2 from arxiv:2603.17970, Southworth & Thomas, Mar 2026).
Single seed (42) on 8xH100 SXM. Marked as non-record due to throughput issue
on H100s.
Results
Convergence Curve
vs. Muon SOTA (PR #180)
Key Finding
MUD achieves strong convergence (1.1989 BPB in only 5,087 steps) but is
4.5x slower per step than Muon on H100s. The paper's throughput claims
(1.3-2.6x over Muon) were measured on A100/MI250/GH200 —
torch.linalg.solve_triangularon H100 CUDA is not as well-optimized as GEMM on Hopper architecture.
If MUD could match Muon's step time, extrapolating the convergence curve
suggests it could reach ~1.10 BPB in 20,000 steps.
What changed
Only the optimizer:
mud_whiten()replaceszeropower_via_newtonschulz5().Everything else (architecture, quantization, training loop) is identical to
SOTA by @thwu1 (PR #180).
References