This repository contains the official implementation for the paper: Controlled LLM Training on Spectral Sphere.
Abstract: Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization (μP) provides a theoretical safeguard for width-invariant Θ(1) activation control, whereas emerging optimizers like Muon are only "half-aligned" with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the Spectral Sphere Optimizer (SSO), which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully μP-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations. Megatron Code is available at SSO Pretrain.
Key Contributions:
- Better Convergence: Outperforms AdamW and Muon with a substantial margin in Dense 1.7B, MoE 8B and 200-layer DeepNet, while keeping the "healthiest" model intrinsic metric
- Controlled Stability: Both weights and updates satisfy μP constraints, offer tunable sphere radius, suppressed outliers and controlled activation scales in favour of low-precision training
- System Efficiency: Atomic Module Sharding, Adaptive Kernel Dispatcher, Cached Singular Vectors, etc. MuonSphere variant retains equivalent activation control with minimal overhead
2. Algorithm
SSO performs steepest descent under the spectral norm, constraining both the weights and the updates to a spectral sphere of radius R = Θ(√(d_out/d_in)).
| Description | Link |
|---|---|
| Main Experiments on Dense, MoE, DeepNet | Baselines |
| μP Learning Rate Transfer Grid Search | MuP Search |
| Spectral Radius Search for Tunable Activation Scale | Radius Search |
SSO is implemented in our fork of Megatron-LM. Use --optimizer spectral_ball_dist for distributed training.
| Argument | Default | Description |
|---|---|---|
--spectral-ball-momentum |
0.9 | Momentum coefficient |
--spectral-ball-use-nesterov |
True | Use Nesterov-style momentum |
--spectral-ball-msign-steps |
8 | Newton-Schulz iterations for matrix sign |
--spectral-ball-solver |
bisection | Lagrange multiplier solver method |
--spectral-ball-solver-tolerance-f |
1e-8 | Solver tolerance |
--spectral-ball-solver-max-iterations |
20 | Maximum solver iterations |
--spectral-ball-power-iteration-steps |
20 | Power iteration steps for top singular vectors |
--spectral-ball-radius-mode |
spectral_mup | Mode for computing target radius R |
--spectral-ball-radius-scaler |
1.0 | Scale factor for target radius |
--spectral-ball-scale-mode |
spectral_mup | LR scale mode (spectral_mup, align_adamw_rms, shape_scaling) |
--spectral-ball-retract-mode |
hard | Retraction mode: hard (project to sphere) or dynamic |
--spectral-ball-retract-alpha |
0.05 | Step size for dynamic retraction |
| Argument | Default | Description |
|---|---|---|
--spectral-mup-init |
- | Enable spectral μP initialization for weights |
--spectral-ball-no-split-qkv |
(enabled) | Disable splitting QKV parameters |
--spectral-ball-qkv-split-mode |
component | QKV split: component, group, or head |
--spectral-ball-no-split-fc1 |
(enabled) | Disable splitting gate/up in SwiGLU |
--spectral-ball-no-split-moe-experts |
(enabled) | Disable per-expert splitting in MoE |
We support logging metrics below for monitoring training stability. Note that MoE max-vio and module spectral norm are logged by default.
# log optimizer update rms before lr scaler
--log-per-module-update-rms
--log-per-module-grad-rms
--log-hidden-states embeddings input_layernorm attention::linear_qkv \
attention::linear_q attention::linear_k attention::linear_v \
attention::core_attention attention::o_proj pre_mlp_layernorm mlp
# Log parameter statistics
--log-params attention::linear_qkv attention::o_proj mlp::linear_fc1 \
mlp::linear_fc2 input_layernorm pre_mlp_layernorm embedding lm_headWe support downstream task evaluation during training:
--benchmark-eval
--benchmark-tasks "sciq_rc_0shot,piqa_rc_0shot,winogrande_rc_0shot,arc_easy_rc_0shot,boolq_rc_0shot,logiqa_rc_0shot,lambada_ppl_0shot,hellaswag_rc_5shot,arc_challenge_rc_5shot"We gratefully acknowledge the developers of Emerging-Optimizers and Megatron-LM
This project is licensed under the Apache License 2.0.
If you have any questions, please raise an issue or contact Unakar






