Skip to content

Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510

Open
SelfAnush wants to merge 4 commits intoopenai:mainfrom
SelfAnush:mud-optimizer-submission
Open

Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510
SelfAnush wants to merge 4 commits intoopenai:mainfrom
SelfAnush:mud-optimizer-submission

Conversation

@SelfAnush
Copy link
Copy Markdown

Summary

Replaces Muon's 5-step Newton-Schulz iteration with MUD's triangular Gram
preconditioning (Algorithm 2 from arxiv:2603.17970, Southworth & Thomas, Mar 2026).
Single seed (42) on 8xH100 SXM. Marked as non-record due to throughput issue
on H100s.

Results

Metric Value
Final val_bpb 1.1989
Final val_loss 2.0243
Steps in 10 min 5,087
step_avg 118ms
Peak memory 18,866 MiB

Convergence Curve

Step val_bpb
500 1.4604
1000 1.3649
2000 1.3191
3000 1.2647
4000 1.2291
5000 1.1945
Final (post-quant) 1.1989

vs. Muon SOTA (PR #180)

Metric Muon MUD (this)
step_avg ~26ms 118ms
Steps in 10 min ~20,000 5,087
Final val_bpb 1.1428 1.1989

Key Finding

MUD achieves strong convergence (1.1989 BPB in only 5,087 steps) but is
4.5x slower per step than Muon on H100s. The paper's throughput claims
(1.3-2.6x over Muon) were measured on A100/MI250/GH200 — torch.linalg.solve_triangular
on H100 CUDA is not as well-optimized as GEMM on Hopper architecture.
If MUD could match Muon's step time, extrapolating the convergence curve
suggests it could reach ~1.10 BPB in 20,000 steps.

What changed

Only the optimizer: mud_whiten() replaces zeropower_via_newtonschulz5().
Everything else (architecture, quantization, training loop) is identical to
SOTA by @thwu1 (PR #180).

References

… slow TRSM on H100

Non-record: MUD optimizer (arxiv:2603.17970)
Replaces Muon's 5-step Newton-Schulz with MUD's triangular Gram
preconditioning. Single seed (42) on 8xH100 SXM.
Results:
- val_bpb: 1.1989 (sliding window eval, stride=64)
- Steps: 5,087 in 10 min
- step_avg: 118ms (4.5x slower than Muon's ~26ms on H100)
Key finding: Strong convergence (within 0.056 BPB of SOTA with 4x
fewer steps) but TRSM overhead on H100 CUDA negates the 12x FLOP
savings reported in the paper (tested on A100/MI250/GH200).
Built on SOTA by @thwu1 (PR openai#180).
Paper: https://arxiv.org/abs/2603.17970
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant