Skip to content

Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)#297

Open
davidpuertolas wants to merge 2 commits intoopenai:mainfrom
davidpuertolas:submission/2026-03-20_STE_QAT_MLP3x_SmearBigram_LoRATTT
Open

Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)#297
davidpuertolas wants to merge 2 commits intoopenai:mainfrom
davidpuertolas:submission/2026-03-20_STE_QAT_MLP3x_SmearBigram_LoRATTT

Conversation

@davidpuertolas
Copy link
Copy Markdown

This record captures Late STE QAT + a dense 9×512 stack (MLP3×, SmearGate, BigramHash, ortho / Overtone-style init, SWA) with full-model SGD test-time training (not LoRA) after sliding-window eval on the dequantized checkpoint.

Method

Training (600s wallclock, 8×H100 SXM)

Muon + AdamW, MLP (hidden 1536), SmearGate, BigramHash, SWA over the second half of warmdown, late STE QAT from ~85% of wallclock with 0.5× LR when QAT activates. Key knobs: matrix_lr=0.025, muon_weight_decay=0.038, train_batch_tokens=786432, train_seq_len=2048, eval_stride=64, etc. (see README.md in this folder).

Evaluation

  1. Serialize int6 (per-row) + compress with zstd level 22 → submission artifact.
  2. Load, sliding-window validation (eval_stride=64).
  3. SGD TTT (LR 3e-4, momentum 0.95; LoRA TTT off by default).
  4. Report roundtrip metrics (final_int8_zstd_roundtrip_exact in logs when using this script).

Why zstd here

Using zstd-22 instead of zlib on the same quantized blob keeps bytes_total under the 16,000,000-byte cap (decimal MB) for this configuration.

Submission metadata

{
  "track": "10min_16mb",
  "date": "2026-03-20",
  "name": "Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT",
  "author": "David Puertolas Merenciano",
  "github_id": "davidpuertolas",
  "blurb": "Late STE QAT (last 15%, per #76) avoids Muon momentum corruption while closing quant gap. Full-model SGD TTT (per #152) replaces LoRA TTT which hurts with SmearGate (#178). WD=0.038 + LR=0.025 from best validated submissions (#179, #194). Artifact: int6+zstd-22, under 16MB cap.",
  "val_loss": 1.96353693,
  "val_bpb": 1.16292025,
  "bytes_total": 15948643,
  "bytes_code": 64426
}
Field Value
val_bpb 1.16292025
val_loss 1.96353693
bytes_total 15,948,643 (below 16,000,000)
bytes_code 64,426
Seed (logged) 1337
Wallclock cap 600s (step=5464 in train.log)

Compressed artifact (logged): 15,884,217 bytes int6+zstd + 64,426 bytes UTF-8 train_gpt.py = 15,948,643 total.

Command

From repo root, with FineWeb sp1024 data and tokenizer installed:

pip install zstandard

export HF_TOKEN="..."   # if needed for dataset download
python3 data/cached_challenge_fineweb.py --variant sp1024
RUN_ID=late_qat_sgd_ttt_zstd \
SEED=1337 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
EVAL_STRIDE=64 \
SGD_TTT_ENABLED=1 \
TTT_LORA_ENABLED=0 \
torchrun --standalone --nproc_per_node=8 \
  old/20/03/26-zstandard/train_gpt.py

Single GPU: --nproc_per_node=1. Longer runs: MAX_WALLCLOCK_SECONDS=0 or another value.

Included files

File Role
old/20/03/26-zstandard/train_gpt.py Training + int6+zstd artifact
old/20/03/26-zstandard/train.log Example log (seed 1337)
old/20/03/26-zstandard/README.md Full write-up
old/20/03/26-zstandard/submission.json Challenge JSON

rarce added a commit to rarce/parameter-golf that referenced this pull request Mar 21, 2026
Three techniques from the top PRs (openai#265, openai#287, openai#297):

1. XSA (Exclusive Self Attention) on last 3 layers (XSA_LAST_N=3):
   Removes self-value bias via orthogonal projection (arXiv:2603.09078).
   GQA-aware: uses reshape+broadcast instead of repeat_interleave.
   Zero new parameters, ~2ms/step overhead.

2. EMA (decay=0.997) replaces SWA (EMA_ENABLED=1, SWA_ENABLED=0):
   Exponential moving average updated every step during warmdown.
   Smoother weight averaging, better generalization/compression.

3. Late QAT (QAT_LATE_FRAC=0.85):
   QAT activates at 85% of wallclock to avoid Muon momentum corruption.
   LR halved when QAT activates (per PR openai#297 finding).

Trimmed comments to stay under 1500-line cap (1457 lines).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davidpuertolas
Copy link
Copy Markdown
Author

any feedback? @0hq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant