Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling (7 toggleable improvements)#130
Open
mohosy wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling (7 toggleable improvements)#130mohosy wants to merge 1 commit intoopenai:mainfrom
mohosy wants to merge 1 commit intoopenai:mainfrom
Conversation
7 independently toggleable training improvements targeting the Muon optimizer's quantization sensitivity. Pending 8xH100 verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling
Status: Non-record (pending 8×H100 verification)
Summary
7 independently toggleable training improvements targeting the Muon optimizer's quantization sensitivity. Each technique was selected based on analysis of Muon's Newton-like orthogonalized update dynamics and the int8/int6 quantization pipeline's failure modes.
Techniques
What's Novel
Muon-aware QAT design: Most QAT implementations use vanilla STE, which injects discrete rounding noise into gradients. With Muon's orthogonalization step, this directional noise gets amplified rather than averaged out (unlike Adam). Our implementation:
LAWA + QAT synergy: QAT teaches quant-robustness during training; LAWA smooths remaining outlier weights post-convergence. Complementary mechanisms targeting the same goal from different angles.
Research Foundation
Backed by targeted analysis of:
Next Steps (in-progress)
Usage
# Full stack QAT_ENABLED=1 QAT_MODE=noise LAWA_ENABLED=1 SEQ_LEN_WARMUP=1 COMPRESSION=zstd \ torchrun --standalone --nproc_per_node=8 train_gpt.pyAll features default to safe values and can be independently toggled via environment variables. See README in submission folder for full documentation.