Pinned
Learning rate schedules seem mysterious?
Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization.
Short thread on our latest paper 🚇
arxiv.org/abs/2501.18965
The sudden loss drop when annealing the learning rate at the end of a WSD (warmup-stable-decay) schedule can be explained without relying on non-convexity or even smoothness, a new paper shows that it can be precisely predicted by theory in the convex, non-smooth setting!
1/2



