Lucas Nestler (@Clashluke) / X

Lucas Nestler

2,101 posts

Lucas Nestler

@Clashluke

Researcher

Zurich, Switzerland

convergentthinking.sh

Joined October 2020

Lucas Nestler
@Clashluke
Jul 3, 2024
Schedule-free optimizers (x.com/aaron_defazio/…) are surreal. I've read the paper, looked into the math, and tried to understand what's happening. It all seems like an incremental improvement at best (like LaProp (arxiv.org/abs/2002.04839) or Adam-Atan2
197K
Lucas Nestler
@Clashluke
Aug 17, 2021
I'm excited to announce my latest project: RevLib. RevNet's (arxiv.org/abs/1707.04585) are one of the biggest game-changers of recent years, and I hope that this library will help increase their appreciation. Go check it out: github.com/ClashLuke/revl….
Lucas Nestler
@Clashluke
Mar 14, 2025
Don't underestimate this change! Simply swapping LayerNorm with DyT (tanh-based) maintains AdamW convergence levels. Why is this big news? Second-order optimizers perform best on normalization-free architectures - which is precisely what DyT enables x.com/liuzhuang1234/…
Zhuang Liu
@liuzhuang1234
Mar 14, 2025
New paper - Transformers, but without normalization layers (1/n)
70K
Lucas Nestler
@Clashluke
Jan 23, 2025
Wake up babe New MoE scaling laws dropped
45K
Lucas Nestler
@Clashluke
Feb 2, 2025
If you want to get into GPU programming, learn CUDA Many influencers are trying to sell you on guides, courses, groups, and more (Triton?). Don't fall for the simplicity trap. @nvidia has got you covered. They want you (need you?) to write good kernels. NVIDIA's "CUDA C++
30K
Lucas Nestler
@Clashluke
Feb 4, 2025
A new paper shows that using L2 distance is better than using a dot product for classification loss. * More expressive * Higher stability * Easier to learn L2 distances have been a long-standing problem. They are frequently used in contrastive learning (Barlow Twins) but
David D. Baek
@dbaek__
Feb 4, 2025
1/9 🚨 New Paper Alert: Cross-Entropy Loss is NOT What You Need! 🚨 We introduce harmonic loss as alternative to the standard CE loss for training neural networks and LLMs! Harmonic loss achieves 🛠️significantly better interpretability, ⚡faster convergence, and ⏳less grokking!
27K
Lucas Nestler
@Clashluke
Aug 6, 2024
Article
Grokking Grokfast
Grokfast (https://arxiv.org/abs/2405.20233) is a new, trendy optimizer used by @nisten and others to accelerate the training of language models by hitting the "grokking" regions faster than any other optimizer...
181K
Lucas Nestler
@Clashluke
Sep 16, 2021
Replying to @Clashluke
I'm excited to announce that RevLib now supports Parameter Offload. With the latest release (1.1.0), you can now train infinitely large models on swap memory! Below is a small example that shows how you now use only 8 KiB instead of 1 GiB to run a 256 million parameter model:
Lucas Nestler
@Clashluke
Oct 8, 2025
TRM is one of the best papers I've read in the past years - it truly shows the unfiltered process of a researcher: 1) See awesome paper, get hyped about it 2) Read it - looks cool 3) Run it - doesn't work 4) Find glaring mistakes 5) Fix the issues
Alexia Jolicoeur-Martineau
@jm_alexia
Oct 7, 2025
New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: alexiajm.github.io/2025/09/29/tin… Code: github.com/SamsungSAILMon… Paper: arxiv.org/abs/2510.04871
69K
Lucas Nestler
@Clashluke
Sep 28, 2025
kinda crazy everyone is missing out on francesco's research x.com/FrancescoSacco…
Francesco Sacco
@FrancescoSacco1
Sep 10, 2025
Another week, another mini-research project out This one is about doing first-principles off-policy RL by treating Q-values as probability distributions
27K
Lucas Nestler
@Clashluke
Feb 21, 2025
A new paper (from @cartesia) distills LLaMa into a state space model with interesting results
32K
Lucas Nestler
@Clashluke
May 21, 2022
This evening (6-8PM UTC), I'll present 𝚃-𝙵𝚎𝚠, a novel Encoder-Decoder training recipe that outperforms GPT-3 with as few as 20 examples. Moreover, their trained model costs less than 0.1% at inference than the more inaccurate GPT-style models. Could this be the end of GPT?
Lucas Nestler
@Clashluke
Nov 20, 2022
Over the past weeks, I've worked on validating @ID_AA_Carmack's hypothesis on how to improve Adam's second-order approximation (x.com/ID_AA_Carmack/…) Resulting from that, I'd like to present TGAdam, an optimizer with up to 50% lower relative error: x.com/_clashluke/sta… 1/11
Lucas Nestler
@Clashluke
Nov 20, 2022
Replying to @Clashluke @ID_AA_Carmack and 2 others
..and that's precisely what happened. Without further tuning, Adam#TGAdam (bottom chunk) outperforms both Adam and TGAdam. (At the cost of one more buffer.) Additionally, when tuned scarsely, you see another 10% to 200% reduction in the relative error rate.🤯
Lucas Nestler
@Clashluke
Mar 10, 2025
.@dvruette might've just solved discrete diffusion (-> Diffusion LMs) Instead of modelling tokens and randomly unmasking them, he proposes a new diffusion framework: GIDD GIDD models discrete data in a continuous space Read his thread for more: x.com/dvruette/statu…
Dimitri von Rütte
@dvruette
Mar 10, 2025
🚨 NEW PAPER DROP! Wouldn't it be nice if LLMs could spot and correct their own mistakes? And what if we could do so directly from pre-training, without any SFT or RL? We present a new class of discrete diffusion models, called GIDD, that are able to do just that: 🧵1/12
26K