Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling (7 toggleable improvements) by mohosy · Pull Request #130 · openai/parameter-golf

mohosy · 2026-03-19T20:48:36Z

Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling

Status: Non-record (pending 8×H100 verification)

Summary

7 independently toggleable training improvements targeting the Muon optimizer's quantization sensitivity. Each technique was selected based on analysis of Muon's Newton-like orthogonalized update dynamics and the int8/int6 quantization pipeline's failure modes.

Techniques

#	Technique	Key Insight
1	Muon-Aware QAT (STE + noise modes)	Standard STE corrupts Muon's momentum subspace; late-start + LR reduction + Gaussian noise mode preserves convergence
2	LAWA (Latest Weight Averaging)	Uniform averaging of converged late-stage checkpoints → smoother weights → lower quant error
3	LR Floor (10% minimum)	Prevents freezing into sharp minima during warmdown — sharp minima are quant-sensitive
4	Cooldown Fraction Schedule	Wall-clock-aware warmdown (60% cooldown vs fixed iters)
5	Sequence Length Warmup	256→1024 tokens, 4× more steps/sec in early training
6	Adaptive Compression	zstd/Brotli support (5–10% smaller artifacts than zlib)
7	Higher Learning Rates	matrix_lr 0.04→0.06, scalar_lr 0.04→0.06, embed_lr 0.05→0.08

What's Novel

Muon-aware QAT design: Most QAT implementations use vanilla STE, which injects discrete rounding noise into gradients. With Muon's orthogonalization step, this directional noise gets amplified rather than averaged out (unlike Adam). Our implementation:

Two modes: STE (standard) and Gaussian noise (smoother for Muon)
Late activation at 75% of training to preserve early momentum subspace
Automatic 50% LR reduction when QAT activates
Only targets large matrices (>65K params) — small control tensors stay clean

LAWA + QAT synergy: QAT teaches quant-robustness during training; LAWA smooths remaining outlier weights post-convergence. Complementary mechanisms targeting the same goal from different angles.

Research Foundation

Backed by targeted analysis of:

Muon optimizer dynamics under quantization noise
Depth recurrence at 15–50M parameter scale (Universal Transformers, "From Growing to Looping")
BitNet b1.58 viability (concluded: not viable at this scale, confirmed by PR Non-record: BitNet b1.58 + depth recurrence + NorMuon (1.7510 BPB, 3.78 MB) #126)
modded-nanogpt speedrun techniques

Next Steps (in-progress)

Int6 quantization + MLP 3× expansion
Sliding window evaluation
NorMuon optimizer
Depth recurrence (middle-looped: 1 prelude + 3 shared × 3 loops + 1 coda)
8×H100 verification runs

Usage

# Full stack
QAT_ENABLED=1 QAT_MODE=noise LAWA_ENABLED=1 SEQ_LEN_WARMUP=1 COMPRESSION=zstd \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

All features default to safe values and can be independently toggled via environment variables. See README in submission folder for full documentation.

7 independently toggleable training improvements targeting the Muon optimizer's quantization sensitivity. Pending 8xH100 verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling

c05d686

7 independently toggleable training improvements targeting the Muon optimizer's quantization sensitivity. Pending 8xH100 verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 19, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling (7 toggleable improvements)#130

Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling (7 toggleable improvements)#130
mohosy wants to merge 1 commit intoopenai:mainfrom
mohosy:submission/mohosy-nonrecord

mohosy commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mohosy commented Mar 19, 2026