Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970) by SelfAnush · Pull Request #510 · openai/parameter-golf

SelfAnush · 2026-03-23T07:22:27Z

Summary

Replaces Muon's 5-step Newton-Schulz iteration with MUD's triangular Gram
preconditioning (Algorithm 2 from arxiv:2603.17970, Southworth & Thomas, Mar 2026).
Single seed (42) on 8xH100 SXM. Marked as non-record due to throughput issue
on H100s.

Results

Metric	Value
Final val_bpb	1.1989
Final val_loss	2.0243
Steps in 10 min	5,087
step_avg	118ms
Peak memory	18,866 MiB

Convergence Curve

Step	val_bpb
500	1.4604
1000	1.3649
2000	1.3191
3000	1.2647
4000	1.2291
5000	1.1945
Final (post-quant)	1.1989

vs. Muon SOTA (PR #180)

Metric	Muon	MUD (this)
step_avg	~26ms	118ms
Steps in 10 min	~20,000	5,087
Final val_bpb	1.1428	1.1989

Key Finding

MUD achieves strong convergence (1.1989 BPB in only 5,087 steps) but is
4.5x slower per step than Muon on H100s. The paper's throughput claims
(1.3-2.6x over Muon) were measured on A100/MI250/GH200 — torch.linalg.solve_triangular
on H100 CUDA is not as well-optimized as GEMM on Hopper architecture.
If MUD could match Muon's step time, extrapolating the convergence curve
suggests it could reach ~1.10 BPB in 20,000 steps.

What changed

Only the optimizer: mud_whiten() replaces zeropower_via_newtonschulz5().
Everything else (architecture, quantization, training loop) is identical to
SOTA by @thwu1 (PR #180).

References

@thwu1

… slow TRSM on H100 Non-record: MUD optimizer (arxiv:2603.17970) Replaces Muon's 5-step Newton-Schulz with MUD's triangular Gram preconditioning. Single seed (42) on 8xH100 SXM. Results: - val_bpb: 1.1989 (sliding window eval, stride=64) - Steps: 5,087 in 10 min - step_avg: 118ms (4.5x slower than Muon's ~26ms on H100) Key finding: Strong convergence (within 0.056 BPB of SOTA with 4x fewer steps) but TRSM overhead on H100 CUDA negates the 12x FLOP savings reported in the paper (tested on A100/MI250/GH200). Built on SOTA by @thwu1 (PR openai#180). Paper: https://arxiv.org/abs/2603.17970

SelfAnush added 4 commits March 23, 2026 08:23

Add MUD optimizer submission (arxiv:2603.17970)

2116281

fixes

d37ba3e

Fix: use float32 for solve_triangular (CUDA bfloat16 not supported)

232d4d4

notapplica mentioned this pull request Mar 23, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510

Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510
SelfAnush wants to merge 4 commits intoopenai:mainfrom
SelfAnush:mud-optimizer-submission

SelfAnush commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SelfAnush commented Mar 23, 2026

Summary

Results

Convergence Curve

vs. Muon SOTA (PR #180)

Key Finding

What changed

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant