Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)#297
Open
davidpuertolas wants to merge 2 commits intoopenai:mainfrom
Conversation
rarce
added a commit
to rarce/parameter-golf
that referenced
this pull request
Mar 21, 2026
Three techniques from the top PRs (openai#265, openai#287, openai#297): 1. XSA (Exclusive Self Attention) on last 3 layers (XSA_LAST_N=3): Removes self-value bias via orthogonal projection (arXiv:2603.09078). GQA-aware: uses reshape+broadcast instead of repeat_interleave. Zero new parameters, ~2ms/step overhead. 2. EMA (decay=0.997) replaces SWA (EMA_ENABLED=1, SWA_ENABLED=0): Exponential moving average updated every step during warmdown. Smoother weight averaging, better generalization/compression. 3. Late QAT (QAT_LATE_FRAC=0.85): QAT activates at 85% of wallclock to avoid Muon momentum corruption. LR halved when QAT activates (per PR openai#297 finding). Trimmed comments to stay under 1500-line cap (1457 lines). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Mar 23, 2026
Author
|
any feedback? @0hq |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This record captures Late STE QAT + a dense 9×512 stack (MLP3×, SmearGate, BigramHash, ortho / Overtone-style init, SWA) with full-model SGD test-time training (not LoRA) after sliding-window eval on the dequantized checkpoint.
Method
Training (600s wallclock, 8×H100 SXM)
Muon + AdamW, MLP 3× (hidden 1536), SmearGate, BigramHash, SWA over the second half of warmdown, late STE QAT from ~85% of wallclock with 0.5× LR when QAT activates. Key knobs:
matrix_lr=0.025,muon_weight_decay=0.038,train_batch_tokens=786432,train_seq_len=2048,eval_stride=64, etc. (seeREADME.mdin this folder).Evaluation
eval_stride=64).3e-4, momentum0.95; LoRA TTT off by default).final_int8_zstd_roundtrip_exactin logs when using this script).Why zstd here
Using zstd-22 instead of zlib on the same quantized blob keeps
bytes_totalunder the 16,000,000-byte cap (decimal MB) for this configuration.Submission metadata
{ "track": "10min_16mb", "date": "2026-03-20", "name": "Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT", "author": "David Puertolas Merenciano", "github_id": "davidpuertolas", "blurb": "Late STE QAT (last 15%, per #76) avoids Muon momentum corruption while closing quant gap. Full-model SGD TTT (per #152) replaces LoRA TTT which hurts with SmearGate (#178). WD=0.038 + LR=0.025 from best validated submissions (#179, #194). Artifact: int6+zstd-22, under 16MB cap.", "val_loss": 1.96353693, "val_bpb": 1.16292025, "bytes_total": 15948643, "bytes_code": 64426 }step=5464intrain.log)Compressed artifact (logged): 15,884,217 bytes int6+zstd + 64,426 bytes UTF-8
train_gpt.py= 15,948,643 total.Command
From repo root, with FineWeb
sp1024data and tokenizer installed:Single GPU:
--nproc_per_node=1. Longer runs:MAX_WALLCLOCK_SECONDS=0or another value.Included files
old/20/03/26-zstandard/train_gpt.pyold/20/03/26-zstandard/train.logold/20/03/26-zstandard/README.mdold/20/03/26-zstandard/submission.json