Laker Newhouse (@LakerNewhouse) / X

Laker Newhouse

21 posts

Laker Newhouse

@LakerNewhouse

MIT '25 researching Muon & ML

Palo Alto, CA

Joined January 2023

Pinned
Laker Newhouse
@LakerNewhouse
Jul 21, 2025
[1/6] Curious about Muon, but not sure where to start? I wrote a 3-part blog series called “Understanding Muon” designed to get you up to speed—with The Matrix references, annotated source code, and thoughts on where Muon might be going.
36K
Laker Newhouse
@LakerNewhouse
Jul 19, 2025
[1/9] We created a performant Lipschitz transformer by spectrally regulating the weights—without using activation stability tricks: no layer norm, QK norm, or logit softcapping. We think this may address a “root cause” of unstable training.
147K
Laker Newhouse
@LakerNewhouse
Jul 19, 2025
Replying to @LakerNewhouse
[2/9] Muon spectrally regulates gradients, but what if we also spectrally regulate weights? Then activations stay small—nearly fp8 range. Activation entries in our GPT-2 scale transformers don’t exceed ~100 vs. baseline ~1000. Check out our paper:
arxiv.org
Training Transformers with Enforced Lipschitz Constants
Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and...
6.2K
Laker Newhouse
@LakerNewhouse
Jul 19, 2025
Replying to @LakerNewhouse
[4/9] One of the things we’re most excited about is efficient primitives inspired by Muon and related to Kimi AI’s recent work. We introduce a family of methods to cap singular values via applying min(1, x), co-designed for Muon’s high stable rank update.
4K
Laker Newhouse
@LakerNewhouse
Jul 19, 2025
Replying to @LakerNewhouse
[9/9] We’re really excited where the community can take this. We’re publishing all our code and data—there’s lots more to test and understand that can help us achieve adversarial robustness, bounded activations, and stable training at scale. Read the paper:
arxiv.org
Training Transformers with Enforced Lipschitz Constants
Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and...
3.3K
Laker Newhouse
@LakerNewhouse
Jul 19, 2025
Replying to @LakerNewhouse
[3/9] Our main goal was to enforce a provable Lipschitz bound on NanoGPT while matching unconstrained val loss. But more work is needed! Our current methods bound the Lipschitz constant at 10^264.
4.6K
Laker Newhouse
@LakerNewhouse
Jul 19, 2025
Replying to @LakerNewhouse
[5/9] A Lipschitz bound controls how sensitive a network is to input or weight changes. With a low bound, a small change in the input can’t wildly change the output. Thus: Lower Lipschitz bound => more robust and more predictable model
3.2K
Laker Newhouse
@LakerNewhouse
Jul 19, 2025
Replying to @LakerNewhouse
[6/9] To control a Lipschitz bound we need to control weight norms. There are many ways to do this, including weight decay, and we compare their “Lipschitzness to performance” tradeoff. Finding: Muon + capping singular values pushes the tradeoff frontier.
2.8K
Laker Newhouse
@LakerNewhouse
Jul 19, 2025
Replying to @LakerNewhouse
[7/9] We’ve seen exciting related work come out recently including @Jianlin_S’s QK-clip algorithm and @_arohan_’s weight constraint thread, so we bet we missed important citations—we’d love to hear any related work we should include!
rohan anil
@_arohan_
Jun 3, 2025
Doing some math to cleanse the timelinez Why do loss blow up? A question to deepthink. So an attempt: why not clip the singular values of the update? σ > 1, clip to 1 σ <=1, return σ Naive implementation: Update = U S V.T Update_clipped = U clip(S, 1) V.T How to make it
3.7K
Laker Newhouse
@LakerNewhouse
Jul 21, 2025
Replying to @LakerNewhouse
[2/6] There’s been a lot of interest in Muon recently, so I wanted to make a practitioner’s guide that's accessible to everyone in the machine learning community.
Understanding Muon
From lakernewhouse.com
2.4K
Laker Newhouse
@LakerNewhouse
Jul 19, 2025
Replying to @LakerNewhouse
[8/9] This is work done alongside fantastic friends and collaborators: @phess002 @leloykun @anzahorodnii @jxbz @phillip_isola
3.6K
Laker Newhouse
@LakerNewhouse
Jul 21, 2025
Replying to @LakerNewhouse
[5/6] Chapter 3 is called “Weight Regulation.” The goal is to orient people toward some exciting recent work on Muon, including @Jianlin_S's MuonClip and our recent paper.
Laker Newhouse
@LakerNewhouse
Jul 19, 2025
[1/9] We created a performant Lipschitz transformer by spectrally regulating the weights—without using activation stability tricks: no layer norm, QK norm, or logit softcapping. We think this may address a “root cause” of unstable training.
2.5K
Laker Newhouse
@LakerNewhouse
Jul 21, 2025
Replying to @LakerNewhouse
[4/6] Chapter 2 is called “Source Code.” This was the original motivation for the whole series: I saw people getting stuck reading through Muon’s code. So I made line-by-line annotations you can hover over to read. No more being confused.
2.4K
Laker Newhouse
@LakerNewhouse
Jul 21, 2025
Replying to @LakerNewhouse
[3/6] Chapter 1 is called “Into the Matrix.” Get ready for some fun Neo references while seeing why Muon looks at the gradient as a matrix, not a vector.
2.1K