[1/6] Curious about Muon, but not sure where to start? I wrote a 3-part blog series called “Understanding Muon” designed to get you up to speed—with The Matrix references, annotated source code, and thoughts on where Muon might be going.
[1/9] We created a performant Lipschitz transformer by spectrally regulating the weights—without using activation stability tricks: no layer norm, QK norm, or logit softcapping. We think this may address a “root cause” of unstable training.
[2/9] Muon spectrally regulates gradients, but what if we also spectrally regulate weights? Then activations stay small—nearly fp8 range. Activation entries in our GPT-2 scale transformers don’t exceed ~100 vs. baseline ~1000.
Check out our paper:
[4/9] One of the things we’re most excited about is efficient primitives inspired by Muon and related to Kimi AI’s recent work. We introduce a family of methods to cap singular values via applying min(1, x), co-designed for Muon’s high stable rank update.
[9/9] We’re really excited where the community can take this.
We’re publishing all our code and data—there’s lots more to test and understand that can help us achieve adversarial robustness, bounded activations, and stable training at scale.
Read the paper:
[3/9] Our main goal was to enforce a provable Lipschitz bound on NanoGPT while matching unconstrained val loss. But more work is needed! Our current methods bound the Lipschitz constant at 10^264.
[5/9] A Lipschitz bound controls how sensitive a network is to input or weight changes.
With a low bound, a small change in the input can’t wildly change the output. Thus:
Lower Lipschitz bound => more robust and more predictable model
[6/9] To control a Lipschitz bound we need to control weight norms. There are many ways to do this, including weight decay, and we compare their “Lipschitzness to performance” tradeoff. Finding: Muon + capping singular values pushes the tradeoff frontier.
[7/9] We’ve seen exciting related work come out recently including @Jianlin_S’s QK-clip algorithm and @_arohan_’s weight constraint thread, so we bet we missed important citations—we’d love to hear any related work we should include!
Doing some math to cleanse the timelinez
Why do loss blow up? A question to deepthink.
So an attempt: why not clip the singular values of the update?
σ > 1, clip to 1
σ <=1, return σ
Naive implementation:
Update = U S V.T
Update_clipped = U clip(S, 1) V.T
How to make it
[2/6] There’s been a lot of interest in Muon recently, so I wanted to make a practitioner’s guide that's accessible to everyone in the machine learning community.
[5/6] Chapter 3 is called “Weight Regulation.” The goal is to orient people toward some exciting recent work on Muon, including @Jianlin_S's MuonClip and our recent paper.
[1/9] We created a performant Lipschitz transformer by spectrally regulating the weights—without using activation stability tricks: no layer norm, QK norm, or logit softcapping. We think this may address a “root cause” of unstable training.
[4/6] Chapter 2 is called “Source Code.” This was the original motivation for the whole series: I saw people getting stuck reading through Muon’s code. So I made line-by-line annotations you can hover over to read. No more being confused.
[3/6] Chapter 1 is called “Into the Matrix.” Get ready for some fun Neo references while seeing why Muon looks at the gradient as a matrix, not a vector.