distributed_shampoo

PyTorch Distributed Shampoo

Distributed Shampoo is a preconditioned stochastic gradient optimizer in the adaptive gradient (Adagrad) family of methods [1, 2]. It converges faster by leveraging neural network-specific structures to achieve comparable model quality/accuracy in fewer iterations or epochs at the cost of additional FLOPs and memory, or achieve higher model quality in the same number of iterations or epochs. Our implementation offers specialized support for serial, Distributed Data Parallel (DDP), Fully Sharded Data Parallel (FSDP), Hybrid Sharding Data Parallel (HSDP), Per-parameter Fully Sharded Data Parallel (FSDP2), and Per-parameter Hybrid Sharded Data Parallel (HSDP2) training.

Distributed Shampoo currently only supports dense parameters.

The key to tuning this optimizer is to balance accuracy, performance, and memory. This is discussed in the Step-by-Step Guide below.

Developers:

Hao-Jun Michael Shi (Meta Platforms, Inc.)
Tsung-Hsien Lee
Anna Cai (Meta Platforms, Inc.)
Runa Eschenhagen (University of Cambridge)
Shintaro Iwasaki (Meta Platforms, Inc.)
Ke Sang (Meta Platforms, Inc.)
Wang Zhou (Meta Platforms, Inc.)
Iris Zhang (Meta Platforms, Inc.)

with contributions and support from:

Ganesh Ajjanagadde (Meta), Rohan Anil (Google), Adnan Aziz (Meta), Pavan Balaji (Meta), Shuo Chang (Meta), Weiwei Chu (Meta), Assaf Eisenman (Meta), Will Feng (Meta), Zhuobo Feng (Meta), Jose Gallego-Posada (Mila / Meta Platforms, Inc.), Avirup Ghosh (Meta), Yizi Gu (Meta), Vineet Gupta (Google), Yuchen Hao (Meta), Brian Hirsh (Meta), Yusuo Hu (Meta), Yuxi Hu (Meta), Minhui Huang (Meta), Guna Lakshminarayanan (Meta), Michael Lazos (Meta), Zhijing Li (Meta), Ming Liang (Meta), Wanchao Liang (Meta), Ying Liu (Meta), Wenguang Mao (Meta), Dheevatsa Mudigere (NVIDIA), Maxim Naumov (Meta), Jongsoo Park (Meta), Mike Rabbat (Meta), Kaushik Rangadurai (Meta), Dennis van der Staay (Meta), Fei Tian (Meta), Rohan Varma (Meta), Sanjay Vishwakarma (Meta), Xunnan (Shawn) Xu (Meta), Jiyan Yang (Meta), Chunxing Yin (Meta), Gavin Zhang (Meta), Haoran Zhang (Meta), Haoyu Zhang (Meta), Chuanhao Zhuge (Meta), and Will Zou (Meta).

🏆 Competition Winner 🏆

Shampoo won the MLCommons AlgoPerf: Training Algorithms Benchmark Competition! 🥇

The external tuning ruleset saw four submissions beating the challenging prize-qualification baseline, improving over the state-of-the-art training algorithm. The "Distributed Shampoo" submission provides an impressive 28% faster model training compared to the baseline, establishing it as a leading optimizer in the field.

This achievement was recognized by major AI organizations:

🐦 AI at Meta announcement
🐦 Google AI announcement

Features

Key distinctives of this implementation include:

Homogeneous multi-node multi-GPU support in PyTorch.
Learning rate grafting [3]. Our version of grafting only grafts the second moment/diagonal preconditioner. Momentum/first moment updates are performed separate from grafting.
Supports both normal and AdamW (decoupled) weight decay.
Incorporates exponential moving averaging (with or without bias correction) to the estimate the first moment (akin to Adam).
Incorporates iterate averaging methods (Generalized Primal Averaging and Schedule-Free) that provide momentum-equivalent behavior with improved theoretical properties [13,14].
Offers multiple approaches for computing the root inverse, including:
- Using symmetric eigendecomposition (used by default).
- Using the QR algorithm to compute an approximate eigendecomposition.
- Coupled inverse Newton iteration [4].
- Higher-order coupled iterations with relative epsilon based on estimate of largest eigenvalue.
Choice of precision for preconditioner accumulation and root inverse computation.
Ability to cache split parameters.
Merging of small dimensions.
Option to (approximately) correct the eigenvalues/run Adam in the eigenbasis of Shampoo's preconditioner (SOAP) [2,6,7].
Option to use an adaptive preconditioner update frequency when symmetric eigendecompositions or the QR algorithm is used [8].
Spectral descent via reduced SVD or Newton-Schulz iteration for 2D gradients, or gradients that have been reshaped to 2D [9,10]. This can be used to implement Muon [11], see Example 6.
KL-Shampoo (without per-factor matrix eigenvalue correction) [12].

Requirements

We have tested this implementation on the following versions of PyTorch:

PyTorch >= 2.8;
Python >= 3.12;
CUDA 11.3-11.4; 12.2+;

Note: We have observed known instabilities with the torch.linalg.eigh operator on CUDA 11.6-12.1, specifically for low-rank matrices, which may appear with using a small start_preconditioning_step. Please avoid these versions of CUDA if possible. See: pytorch/pytorch#94772.

How to Use

Given a learning rate schedule for your previous base optimizer, we can replace the optimizer with Shampoo and "graft" from the learning rate schedule of the base method. Alternatively, you can consider replacing Adam(W) by eigenvalue-corrected Shampoo (SOAP).

A few notes on hyperparameters:

Notice that Shampoo contains some new hyperparameters (max_preconditioner_dim and precondition_frequency) that are important for performance. We describe how to tune these below in the section on Hyperparameter Tuning.
Here, betas refer to the hyperparameters used for the exponential moving average of the gradients and Shampoo preconditioners, while grafting_beta2 corresponds to the beta2 used specifically for exponential moving averaging of the grafted method. This is similar for epsilon and grafting_epsilon. As a first choice, we recommend setting betas equal to the previous betas and additionally setting grafting_beta2 equal to betas[1], and set epsilon = 1e-12 and grafting_epsilon equal to the previous epsilon.
We also distinguish between beta1 and iterate averaging. beta1 (via betas[0]) corresponds to the EMA of the gradients (or gradient filtering), while iterate averaging (via iterate_averaging_config) provides momentum-like behavior through primal averaging. See Example 7 for details on configuring iterate averaging to achieve SGD momentum equivalence.
We allow for decoupled and coupled weight decay. If one sets use_decoupled_weight_decay=True, then you are enabling AdamW-style weight decay, while use_decoupled_weight_decay=False corresponds to the normal L2-regularization style weight decay.
When setting preconditioner_config as an instance of EigenvalueCorrectedShampooPreconditionerConfig (see Example 5), there is typically no need to use learning rate grafting from Adam (grafting_config=None) and, when they are available, Adam's optimal lr, betas, and weight_decay should be a good starting point for further tuning. However, the case of beta2=1.0, i.e. an AdaGrad-like accumulation, has not been explored yet. Also, in settings where Shampoo would usually graft its learning rate from SGD, grafting might still be beneficial.