Skip to content

Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling (7 toggleable improvements)#130

Open
mohosy wants to merge 1 commit intoopenai:mainfrom
mohosy:submission/mohosy-nonrecord
Open

Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling (7 toggleable improvements)#130
mohosy wants to merge 1 commit intoopenai:mainfrom
mohosy:submission/mohosy-nonrecord

Conversation

@mohosy
Copy link
Copy Markdown

@mohosy mohosy commented Mar 19, 2026

Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling

Status: Non-record (pending 8×H100 verification)


Summary

7 independently toggleable training improvements targeting the Muon optimizer's quantization sensitivity. Each technique was selected based on analysis of Muon's Newton-like orthogonalized update dynamics and the int8/int6 quantization pipeline's failure modes.

Techniques

# Technique Key Insight
1 Muon-Aware QAT (STE + noise modes) Standard STE corrupts Muon's momentum subspace; late-start + LR reduction + Gaussian noise mode preserves convergence
2 LAWA (Latest Weight Averaging) Uniform averaging of converged late-stage checkpoints → smoother weights → lower quant error
3 LR Floor (10% minimum) Prevents freezing into sharp minima during warmdown — sharp minima are quant-sensitive
4 Cooldown Fraction Schedule Wall-clock-aware warmdown (60% cooldown vs fixed iters)
5 Sequence Length Warmup 256→1024 tokens, 4× more steps/sec in early training
6 Adaptive Compression zstd/Brotli support (5–10% smaller artifacts than zlib)
7 Higher Learning Rates matrix_lr 0.04→0.06, scalar_lr 0.04→0.06, embed_lr 0.05→0.08

What's Novel

Muon-aware QAT design: Most QAT implementations use vanilla STE, which injects discrete rounding noise into gradients. With Muon's orthogonalization step, this directional noise gets amplified rather than averaged out (unlike Adam). Our implementation:

  • Two modes: STE (standard) and Gaussian noise (smoother for Muon)
  • Late activation at 75% of training to preserve early momentum subspace
  • Automatic 50% LR reduction when QAT activates
  • Only targets large matrices (>65K params) — small control tensors stay clean

LAWA + QAT synergy: QAT teaches quant-robustness during training; LAWA smooths remaining outlier weights post-convergence. Complementary mechanisms targeting the same goal from different angles.

Research Foundation

Backed by targeted analysis of:

Next Steps (in-progress)

  • Int6 quantization + MLP 3× expansion
  • Sliding window evaluation
  • NorMuon optimizer
  • Depth recurrence (middle-looped: 1 prelude + 3 shared × 3 loops + 1 coda)
  • 8×H100 verification runs

Usage

# Full stack
QAT_ENABLED=1 QAT_MODE=noise LAWA_ENABLED=1 SEQ_LEN_WARMUP=1 COMPRESSION=zstd \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

All features default to safe values and can be independently toggled via environment variables. See README in submission folder for full documentation.

7 independently toggleable training improvements targeting the Muon
optimizer's quantization sensitivity. Pending 8xH100 verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant