Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds) by yesbhautik · Pull Request #64 · openai/parameter-golf

yesbhautik · 2026-03-19T07:32:20Z

Summary

Standard-track submission: records/track_10min_16mb/2026-03-21_DominationV3/
GPTQ-lite optimal clip percentile search during int6 quantization
25-epoch aggressive SGD TTT (lr=0.012, all blocks unfrozen) on already-graded tokens
Partial RoPE (16/64 dims) + LN Scale + XSA removed for more training steps
3-seed verified, all under 600s training, eval under 600s, and under 16MB

Results (8xH100 SXM)

Seed	val_bpb	train_time_ms	TTT_time_ms	total_artifact_bytes
1337	1.12513674	599779	389133	15965664
7	1.12540132	599841	~389000	15829190
42	1.12431423	599822	~389000	15806256
Mean	1.12495076

Technique Stack

11 layers, 512 dim, 8 heads, 4 KV heads (GQA), MLP 3x (1536 hidden)
Partial RoPE (16/64 dims), LN Scale (1/sqrt(layer+1))
EMA (decay=0.997)
GPTQ-lite: Per-row optimal clip percentile (5 candidates) minimizing reconstruction MSE
25-epoch SGD TTT (lr=0.012, momentum=0.9, all blocks unfrozen)
Per-dim SmearGate + BigramHash(4096x128)
Int6 per-row quantization (mlp+attn+tok_emb) + zstd-22
Muon optimizer WD=0.04, momentum 0.99, OrthoInit, U-Net skip connections
Sliding window eval stride=64

Rule Compliance

Trains only on FineWeb train shards (80 shards, no validation data during training)
TTT runs at eval time on already-graded validation tokens (per FAQ rules)
Training capped to 599.8s (all seeds stop under 600s)
TTT (~389s) + sliding eval (~197s) = ~586s total eval (under 10-minute eval limit)
Artifact = compressed_model_bytes + code_bytes, all under 16,000,000

…ing window + tuned Muon

Every submission scoring <1.18 BPB uses these EXACT settings. We were running defaults — now matching the winners: MUON_MOMENTUM: 0.95 → 0.99 (stronger smoothing) MATRIX_LR: 0.04 → 0.02 (halved, reduces quant gap) SCALAR_LR: 0.04 → 0.02 (halved) TIED_EMBED_LR: 0.05 → 0.03 (halved) WARMDOWN_ITERS: 1200 → 3000 (longer warmdown) MUON_WARMUP_START: 0.85 → 0.92 (higher start) MUON_WARMUP_STEPS: 500 → 1500 (3x longer warmup) These settings are proven by PR openai#64 (1.0149), openai#66 (1.1652), openai#70 (1.1659), openai#65 (1.1808) — all top submissions. Applied to both v5 and v6. Both compile, 1498 lines each.

…x + STE int6 QAT Upgrades CombinedOptimal to V2 (0.9695 BPB, from 1.0149) and adds StandardOptimal (1.1629 BPB). New techniques: MLP 3x (h=1536), STE fake-int6 quantization-aware training, true int6 per-row (31 levels) for blocks with int8 for embedding. Both run on 8xH100 SXM within 10 minutes + sliding window eval.

yesbhautik · 2026-03-19T16:46:39Z

V2 Update

Major upgrade from the initial submission (1.0149 → 0.9695 BPB). Three new techniques added:

What changed:

MLP 3x expansion (h=1536): 50% wider feedforward, enabled by int6 quantization freeing artifact space
STE fake-int6 quantization-aware training: Weights are fake-quantized to [-31, 31] during the forward pass via Straight-Through Estimator. The model learns weight distributions that survive post-training int6 quantization, dropping the quant penalty from ~0.05 to ~0.001 BPB
True int6 per-row quantization (31 levels) on block weights, with int8 (127 levels) preserved for the embedding which lacks STE protection
Architecture adjusted from 10 layers → 9 layers to fit MLP 3x under 16MB

Also added: StandardOptimal submission trained on full FineWeb (no val-only), achieving val_bpb=1.1629

Both submissions share the same train_gpt.py and run within 10 minutes on 8xH100 SXM + ~5 min sliding window eval.

His 0.9695 BPB is val-only training (separate track). Standard score is 1.1629, close to Larson's 1.1574. No novel architecture — wins with tuning: stride-64 sliding window, seq_len=4096, mixed int6/int8 quant, Muon momentum=0.99. Crucially, he uses NO weight sharing, meaning our fractal approach is an orthogonal improvement on top of his full stack. https://claude.ai/code/session_01RtoPPgJGUFS7XfcFCPwYtq

…ing val-only openai#1 behavior. This adds a reusable Modal launcher and updates standard submission artifacts/logs to reflect the new best quantized result (val_bpb 1.14649233).

yesbhautik · 2026-03-20T06:56:53Z

Update: Dual-track #1 push (new standard SOTA)

I reran and tuned the submission with a standard-specific profile and got a new best post-quant standard score:

standard val_bpb: 1.14649233 (from 1.15784547)
standard val_loss: 1.93579915
steps: 7223 in 600.031s
artifact size: 15,930,918 bytes
run id: standard_optimal_v6_s7

This beats the current public standard SOTA (1.1483) by ~0.00181 bpb.

What changed (standard mode)

Disabled STE fake quant during training (STE_QAT_ENABLED=0)
Mixed export quantization: int6 for MLP/attention, int8 fallback for others
Expanded fp16 passthrough to tok_emb + blocks.8.attn.c_k
Increased Muon WD to 0.04
Denser SWA collection (SWA_EVERY=50, warmdown-gated)

Files updated

records/track_10min_16mb/2026-03-19_StandardOptimal/train_gpt.py
records/track_10min_16mb/2026-03-19_StandardOptimal/submission.json
records/track_10min_16mb/2026-03-19_StandardOptimal/README.md
records/track_10min_16mb/2026-03-19_StandardOptimal/train.log
run_modal.py (new helper runner)

Implement standard mixed int5/int6 quant controls and val-only depth-focused tuning, then add profile/tag-based Modal sweep knobs for reproducible matrix runs. Keep record artifacts aligned to current valid bests (val-only 0.9271, standard 1.1465) while updating PR metadata for clear dual-track reporting.

yesbhautik · 2026-03-20T10:32:05Z

Executed the full dual-track counterattack plan with new sweepable runner presets and upgraded train scripts. I tested multiple standard and val-only profiles on 8xH100; strongest valid winners remain standard=1.14649233 and val-only=0.92711679 under the 16MB cap. Several lower val-only scores were achieved but exceeded artifact-size constraints.

cocohearts · 2026-03-20T18:31:01Z

small nit pls move run_modal inside your submission folder
also did you train on the val set for val-only? that doesnt follow the rules have to be train-only
pls update pr

…_bpb=1.1391) 3-seed verified standard track submission with per-dim SmearGate, RoPE base 50K, int6 per-row quantization, SWA every 50 steps, and orthogonal init. Seeds: 1337 (1.1391), 42 (1.1378), 7 (1.1404). Mean: 1.1391. All under 16MB. Also updated run_modal.py with domv1 profile and semicolon-separated extra_env.

yesbhautik · 2026-03-21T00:15:37Z

DominationV1: New SOTA Push (mean val_bpb=1.1391)

Added new record folder: records/track_10min_16mb/2026-03-20_DominationV1/

Standard Track Results (3-seed verified)

Seed	val_bpb	Artifact
42	1.13781	15.38 MB
1337	1.13915	15.42 MB
7	1.14038	15.39 MB
Mean	1.13911

Key Improvements Over Previous Submission (1.1465)

11 layers (was 9-10) — depth-funded by efficient int6 compression
Per-dimension SmearGate — 512 independent blend ratios vs scalar gate
RoPE base 50K — better position encoding for 2048 seq len
Tuned LR 0.02 (was 0.025) — better convergence
Compact serialization — flat dict format matching community best practice
BigramHash 2048 (was 4096) — saves artifact space

Technique Stack

11L MLP-3x, per-dim SmearGate, BigramHash(2048x128), int6 per-row (mlp+attn), int8 (embed), OrthoInit, Muon WD=0.04, SWA/50 (~30 ckpts), zstd-22, RoPE 50K, sliding eval stride=64.

Val-Only Track

Existing submission (0.9271) remains dominant.

Files Updated

records/track_10min_16mb/2026-03-20_DominationV1/ (new folder)
run_modal.py (updated with domv1 profile)

@cocohearts

Addresses review from @cocohearts: 1. Removed CombinedOptimal (val-only) — trained on validation set, which violates the rules (must be train-only). 2. Removed old StandardOptimal (superseded by DominationV1). 3. Moved run_modal.py from repo root into the DominationV1 submission folder. Only the DominationV1 record folder remains, which trains on full FineWeb (80 shards, train-only) and evaluates on the fixed validation split.

yesbhautik · 2026-03-21T00:20:58Z

@cocohearts Thanks for the review — both issues are now fixed:

Moved run_modal.py inside the submission folder at records/track_10min_16mb/2026-03-20_DominationV1/run_modal.py. No root-level file changes remain.
Removed the val-only (CombinedOptimal) submission entirely. You're right — it trained on the validation set, which violates the rules. That folder is deleted.
Also removed the old StandardOptimal folder (superseded by DominationV1).

The PR now only adds a single new folder: records/track_10min_16mb/2026-03-20_DominationV1/ containing train_gpt.py, submission.json, README.md, train.log, and run_modal.py. The DominationV1 submission trains exclusively on the full FineWeb training set (80 shards) — no validation data access during training.

Score: mean val_bpb = 1.1391 (3-seed verified: 1.1378 / 1.1391 / 1.1404), all artifacts under 16MB.

Also merged latest upstream/main to stay in sync with the reset leaderboard.

Replaces DominationV1 with V2 adding three new techniques: - XSA (Exclusive Self Attention) on last 4 layers (arXiv:2603.09078) - EMA weight averaging (decay=0.997) replacing SWA - TTT (test-time training, 3-epoch SGD at eval time) 3-seed verified: 1337 (1.1367), 42 (1.1373), 7 (1.1393). Mean: 1.1377. All artifacts under 16MB. Trains on full FineWeb only (no val data).

Add a submission-complete DominationV3 folder with a compact no-TTT training path, strict-under-10m logs, and 3-seed verification. This makes the record stronger while keeping validation usage evaluation-only and artifact bytes safely under 16MB.

Add 3-epoch SGD test-time training to DominationV3 (on already-graded validation tokens, rule-compliant). TTT runs in ~46s at eval time after quantization. 3-seed verified: 1337 (1.13337), 7 (1.13374), 42 (1.13374). Mean: 1.13362. All artifacts under 16MB, all training under 600s.

Major improvement via aggressive test-time training: 25 epochs SGD (lr=0.01, momentum=0.9, all blocks unfrozen) on already-graded validation tokens. Also added Partial RoPE (16/64 dims), LN Scale (1/sqrt(layer+1)), and removed XSA to gain ~130 more training steps from step-time savings. 3-seed verified: 1337 (1.12561), 7 (1.12678), 42 (1.12572). Mean: 1.12604. All artifacts under 16MB, training under 600s, eval (TTT+sliding) under 600s.

Add GPTQ-lite optimal clip percentile search (5 candidates, min MSE) and tune TTT learning rate to 0.012 for better adaptation within 16MB budget. Also adds Partial RoPE (16/64), LN Scale, and two-phase TTT infrastructure. 3-seed verified: 1337 (1.12514), 7 (1.12540), 42 (1.12431). Mean: 1.12495. All artifacts under 16MB, training under 600s, eval (TTT+sliding) under 600s.

yesbhautik · 2026-03-23T22:00:15Z

Hey @cocohearts @0hq @longouyang @welinder
Can you guys please take a look at my new commits, and update the things accordingly.

V2 is an intermediate iteration that V3 fully supersedes. Keeping only the active submission to reduce review scope.

Record: Combined Optimal (val_bpb=1.0149), 10L int6 + val-only + slid…

8cdc834

…ing window + tuned Muon

jojo23333 mentioned this pull request Mar 19, 2026

[Discussion] On the use of training on validation set #67

Closed

devin-ai-integration bot mentioned this pull request Mar 19, 2026

9L MLP3x + STE int6 QAT + ROPE=200K + warmdown=14K: val_bpb=0.9588 — 0.2656 nats over baseline andrewgcodes/parameter-golf#1

Open

5 tasks

jordankzf mentioned this pull request Mar 19, 2026

Unofficial Leaderboard #83

Closed

yesbhautik changed the title ~~Record: Combined Optimal (val_bpb=1.0149) — 4 techniques stacked~~ Record: val_bpb=0.9695 (val-only) + val_bpb=1.1629 (standard) Mar 19, 2026

0hq added the record submission ready for review label Mar 19, 2026

Beat standard-track SOTA with a mode-specific v6 recipe while preserv…

672f17d

…ing val-only openai#1 behavior. This adds a reusable Modal launcher and updates standard submission artifacts/logs to reflect the new best quantized result (val_bpb 1.14649233).

yesbhautik changed the title ~~Record: val_bpb=0.9695 (val-only) + val_bpb=1.1629 (standard)~~ Record Update: val_bpb=0.9271 (val-only) + 1.1465 (standard) Mar 20, 2026

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

cocohearts added the invalid This doesn't seem right label Mar 20, 2026

yesbhautik added 2 commits March 21, 2026 05:49

Merge remote-tracking branch 'upstream/main'

ca45125

yesbhautik changed the title ~~Record Update: val_bpb=0.9271 (val-only) + 1.1465 (standard)~~ Record: 11L Int6 + Per-dim SmearGate + RoPE50K (mean val_bpb=1.1391) Mar 21, 2026

yesbhautik changed the title ~~Record: 11L Int6 + Per-dim SmearGate + RoPE50K (mean val_bpb=1.1391)~~ Record: 11L XSA + EMA + TTT (mean val_bpb=1.1377) Mar 21, 2026

yesbhautik changed the title ~~Record: 11L XSA + EMA + TTT (mean val_bpb=1.1377)~~ Record: DominationV3 compact no-TTT (mean val_bpb=1.1349, 3 seeds) Mar 22, 2026

yesbhautik changed the title ~~Record: DominationV3 compact no-TTT (mean val_bpb=1.1349, 3 seeds)~~ Record: DominationV3 + TTT (mean val_bpb=1.1336, 3 seeds) Mar 22, 2026

yesbhautik changed the title ~~Record: DominationV3 + TTT (mean val_bpb=1.1336, 3 seeds)~~ Record: DominationV3 + TTT25 (mean val_bpb=1.1260, 3 seeds) Mar 22, 2026

yesbhautik changed the title ~~Record: DominationV3 + TTT25 (mean val_bpb=1.1260, 3 seeds)~~ Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds) Mar 23, 2026

Remove DominationV2 folder — superseded by DominationV3

741fc55

V2 is an intermediate iteration that V3 fully supersedes. Keeping only the active submission to reduce review scope.

shouryamaanjain mentioned this pull request Mar 27, 2026

Submission: DominationV2 + BOS-Reset Bigram Cache + TTT (val_bpb=1.1382, 3-seed mean) #958

Open

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Mar 29, 2026

exp: port openai#1060 quantizer envelope on top of openai#64

0a6d76b

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Mar 29, 2026

exp: add bounded ASQU activation swap on openai#64 trunk

569b853

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds)#64

Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds)#64
yesbhautik wants to merge 13 commits intoopenai:mainfrom
yesbhautik:main

yesbhautik commented Mar 19, 2026 •

edited

Loading

Uh oh!

yesbhautik commented Mar 19, 2026

Uh oh!

yesbhautik commented Mar 20, 2026 •

edited

Loading

Uh oh!

yesbhautik commented Mar 20, 2026

Uh oh!

cocohearts commented Mar 20, 2026

Uh oh!

yesbhautik commented Mar 21, 2026

Uh oh!

yesbhautik commented Mar 21, 2026

Uh oh!

yesbhautik commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yesbhautik commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (8xH100 SXM)

Technique Stack

Rule Compliance

Uh oh!

yesbhautik commented Mar 19, 2026

V2 Update

Uh oh!

yesbhautik commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update: Dual-track #1 push (new standard SOTA)

What changed (standard mode)

Files updated

Uh oh!

yesbhautik commented Mar 20, 2026

Uh oh!

cocohearts commented Mar 20, 2026

Uh oh!

yesbhautik commented Mar 21, 2026

DominationV1: New SOTA Push (mean val_bpb=1.1391)

Standard Track Results (3-seed verified)

Key Improvements Over Previous Submission (1.1465)

Technique Stack

Val-Only Track

Files Updated

Uh oh!

yesbhautik commented Mar 21, 2026

Uh oh!

yesbhautik commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yesbhautik commented Mar 19, 2026 •

edited

Loading

yesbhautik commented Mar 20, 2026 •

edited

Loading