Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22) by davidpuertolas · Pull Request #297 · openai/parameter-golf

davidpuertolas · 2026-03-21T00:28:53Z

This record captures Late STE QAT + a dense 9×512 stack (MLP3×, SmearGate, BigramHash, ortho / Overtone-style init, SWA) with full-model SGD test-time training (not LoRA) after sliding-window eval on the dequantized checkpoint.

Method

Training (600s wallclock, 8×H100 SXM)

Muon + AdamW, MLP 3× (hidden 1536), SmearGate, BigramHash, SWA over the second half of warmdown, late STE QAT from ~85% of wallclock with 0.5× LR when QAT activates. Key knobs: matrix_lr=0.025, muon_weight_decay=0.038, train_batch_tokens=786432, train_seq_len=2048, eval_stride=64, etc. (see README.md in this folder).

Evaluation

Serialize int6 (per-row) + compress with zstd level 22 → submission artifact.
Load, sliding-window validation (eval_stride=64).
SGD TTT (LR 3e-4, momentum 0.95; LoRA TTT off by default).
Report roundtrip metrics (final_int8_zstd_roundtrip_exact in logs when using this script).

Why zstd here

Using zstd-22 instead of zlib on the same quantized blob keeps bytes_total under the 16,000,000-byte cap (decimal MB) for this configuration.

Submission metadata

{
  "track": "10min_16mb",
  "date": "2026-03-20",
  "name": "Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT",
  "author": "David Puertolas Merenciano",
  "github_id": "davidpuertolas",
  "blurb": "Late STE QAT (last 15%, per #76) avoids Muon momentum corruption while closing quant gap. Full-model SGD TTT (per #152) replaces LoRA TTT which hurts with SmearGate (#178). WD=0.038 + LR=0.025 from best validated submissions (#179, #194). Artifact: int6+zstd-22, under 16MB cap.",
  "val_loss": 1.96353693,
  "val_bpb": 1.16292025,
  "bytes_total": 15948643,
  "bytes_code": 64426
}

Field	Value
val_bpb	1.16292025
val_loss	1.96353693
bytes_total	15,948,643 (below 16,000,000)
bytes_code	64,426
Seed (logged)	1337
Wallclock cap	600s (`step=5464` in `train.log`)

Compressed artifact (logged): 15,884,217 bytes int6+zstd + 64,426 bytes UTF-8 train_gpt.py = 15,948,643 total.

Command

From repo root, with FineWeb sp1024 data and tokenizer installed:

pip install zstandard

export HF_TOKEN="..."   # if needed for dataset download
python3 data/cached_challenge_fineweb.py --variant sp1024

RUN_ID=late_qat_sgd_ttt_zstd \
SEED=1337 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
EVAL_STRIDE=64 \
SGD_TTT_ENABLED=1 \
TTT_LORA_ENABLED=0 \
torchrun --standalone --nproc_per_node=8 \
  old/20/03/26-zstandard/train_gpt.py

Single GPU: --nproc_per_node=1. Longer runs: MAX_WALLCLOCK_SECONDS=0 or another value.

Included files

File	Role
`old/20/03/26-zstandard/train_gpt.py`	Training + int6+zstd artifact
`old/20/03/26-zstandard/train.log`	Example log (seed 1337)
`old/20/03/26-zstandard/README.md`	Full write-up
`old/20/03/26-zstandard/submission.json`	Challenge JSON

Three techniques from the top PRs (openai#265, openai#287, openai#297): 1. XSA (Exclusive Self Attention) on last 3 layers (XSA_LAST_N=3): Removes self-value bias via orthogonal projection (arXiv:2603.09078). GQA-aware: uses reshape+broadcast instead of repeat_interleave. Zero new parameters, ~2ms/step overhead. 2. EMA (decay=0.997) replaces SWA (EMA_ENABLED=1, SWA_ENABLED=0): Exponential moving average updated every step during warmdown. Smoother weight averaging, better generalization/compression. 3. Late QAT (QAT_LATE_FRAC=0.85): QAT activates at 85% of wallclock to avoid Muon momentum corruption. LR halved when QAT activates (per PR openai#297 finding). Trimmed comments to stay under 1500-line cap (1457 lines). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davidpuertolas · 2026-03-24T07:35:56Z

any feedback? @0hq

davidpuertolas added 2 commits March 21, 2026 01:16

Add submission STE QAT MLP3x SmearBigram LoRATTT

e88a6b0

Add submission STE QAT MLP3x SmearBigram LoRATTT

6b710c1

notapplica mentioned this pull request Mar 21, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

This was referenced Mar 23, 2026

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #534

Closed

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)#297

Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)#297
davidpuertolas wants to merge 2 commits intoopenai:mainfrom
davidpuertolas:submission/2026-03-20_STE_QAT_MLP3x_SmearBigram_LoRATTT

davidpuertolas commented Mar 21, 2026

Uh oh!

davidpuertolas commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidpuertolas commented Mar 21, 2026

Method

Training (600s wallclock, 8×H100 SXM)

Evaluation

Why zstd here

Submission metadata

Command

Included files

Uh oh!

davidpuertolas commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant