Skip to content

Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds)#64

Open
yesbhautik wants to merge 13 commits intoopenai:mainfrom
yesbhautik:main
Open

Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds)#64
yesbhautik wants to merge 13 commits intoopenai:mainfrom
yesbhautik:main

Conversation

@yesbhautik
Copy link
Copy Markdown

@yesbhautik yesbhautik commented Mar 19, 2026

Summary

  • Standard-track submission: records/track_10min_16mb/2026-03-21_DominationV3/
  • GPTQ-lite optimal clip percentile search during int6 quantization
  • 25-epoch aggressive SGD TTT (lr=0.012, all blocks unfrozen) on already-graded tokens
  • Partial RoPE (16/64 dims) + LN Scale + XSA removed for more training steps
  • 3-seed verified, all under 600s training, eval under 600s, and under 16MB

Results (8xH100 SXM)

Seed val_bpb train_time_ms TTT_time_ms total_artifact_bytes
1337 1.12513674 599779 389133 15965664
7 1.12540132 599841 ~389000 15829190
42 1.12431423 599822 ~389000 15806256
Mean 1.12495076

Technique Stack

  • 11 layers, 512 dim, 8 heads, 4 KV heads (GQA), MLP 3x (1536 hidden)
  • Partial RoPE (16/64 dims), LN Scale (1/sqrt(layer+1))
  • EMA (decay=0.997)
  • GPTQ-lite: Per-row optimal clip percentile (5 candidates) minimizing reconstruction MSE
  • 25-epoch SGD TTT (lr=0.012, momentum=0.9, all blocks unfrozen)
  • Per-dim SmearGate + BigramHash(4096x128)
  • Int6 per-row quantization (mlp+attn+tok_emb) + zstd-22
  • Muon optimizer WD=0.04, momentum 0.99, OrthoInit, U-Net skip connections
  • Sliding window eval stride=64

Rule Compliance

  • Trains only on FineWeb train shards (80 shards, no validation data during training)
  • TTT runs at eval time on already-graded validation tokens (per FAQ rules)
  • Training capped to 599.8s (all seeds stop under 600s)
  • TTT (~389s) + sliding eval (~197s) = ~586s total eval (under 10-minute eval limit)
  • Artifact = compressed_model_bytes + code_bytes, all under 16,000,000

manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Mar 19, 2026
Every submission scoring <1.18 BPB uses these EXACT settings.
We were running defaults — now matching the winners:

  MUON_MOMENTUM:       0.95 → 0.99 (stronger smoothing)
  MATRIX_LR:           0.04 → 0.02 (halved, reduces quant gap)
  SCALAR_LR:           0.04 → 0.02 (halved)
  TIED_EMBED_LR:       0.05 → 0.03 (halved)
  WARMDOWN_ITERS:      1200 → 3000 (longer warmdown)
  MUON_WARMUP_START:   0.85 → 0.92 (higher start)
  MUON_WARMUP_STEPS:   500  → 1500 (3x longer warmup)

These settings are proven by PR openai#64 (1.0149), openai#66 (1.1652),
openai#70 (1.1659), openai#65 (1.1808) — all top submissions.

Applied to both v5 and v6. Both compile, 1498 lines each.
@jordankzf jordankzf mentioned this pull request Mar 19, 2026
…x + STE int6 QAT

Upgrades CombinedOptimal to V2 (0.9695 BPB, from 1.0149) and adds StandardOptimal (1.1629 BPB).

New techniques: MLP 3x (h=1536), STE fake-int6 quantization-aware training,
true int6 per-row (31 levels) for blocks with int8 for embedding.
Both run on 8xH100 SXM within 10 minutes + sliding window eval.
@yesbhautik yesbhautik changed the title Record: Combined Optimal (val_bpb=1.0149) — 4 techniques stacked Record: val_bpb=0.9695 (val-only) + val_bpb=1.1629 (standard) Mar 19, 2026
@yesbhautik
Copy link
Copy Markdown
Author

V2 Update

Major upgrade from the initial submission (1.0149 → 0.9695 BPB). Three new techniques added:

What changed:

  • MLP 3x expansion (h=1536): 50% wider feedforward, enabled by int6 quantization freeing artifact space
  • STE fake-int6 quantization-aware training: Weights are fake-quantized to [-31, 31] during the forward pass via Straight-Through Estimator. The model learns weight distributions that survive post-training int6 quantization, dropping the quant penalty from ~0.05 to ~0.001 BPB
  • True int6 per-row quantization (31 levels) on block weights, with int8 (127 levels) preserved for the embedding which lacks STE protection
  • Architecture adjusted from 10 layers → 9 layers to fit MLP 3x under 16MB

Also added: StandardOptimal submission trained on full FineWeb (no val-only), achieving val_bpb=1.1629

Both submissions share the same train_gpt.py and run within 10 minutes on 8xH100 SXM + ~5 min sliding window eval.

newjordan referenced this pull request in newjordan/parameter-golf Mar 20, 2026
His 0.9695 BPB is val-only training (separate track). Standard score is
1.1629, close to Larson's 1.1574. No novel architecture — wins with
tuning: stride-64 sliding window, seq_len=4096, mixed int6/int8 quant,
Muon momentum=0.99. Crucially, he uses NO weight sharing, meaning our
fractal approach is an orthogonal improvement on top of his full stack.

https://claude.ai/code/session_01RtoPPgJGUFS7XfcFCPwYtq
…ing val-only openai#1 behavior. This adds a reusable Modal launcher and updates standard submission artifacts/logs to reflect the new best quantized result (val_bpb 1.14649233).
@yesbhautik
Copy link
Copy Markdown
Author

yesbhautik commented Mar 20, 2026

Update: Dual-track #1 push (new standard SOTA)

I reran and tuned the submission with a standard-specific profile and got a new best post-quant standard score:

  • standard val_bpb: 1.14649233 (from 1.15784547)
  • standard val_loss: 1.93579915
  • steps: 7223 in 600.031s
  • artifact size: 15,930,918 bytes
  • run id: standard_optimal_v6_s7

This beats the current public standard SOTA (1.1483) by ~0.00181 bpb.

What changed (standard mode)

  • Disabled STE fake quant during training (STE_QAT_ENABLED=0)
  • Mixed export quantization: int6 for MLP/attention, int8 fallback for others
  • Expanded fp16 passthrough to tok_emb + blocks.8.attn.c_k
  • Increased Muon WD to 0.04
  • Denser SWA collection (SWA_EVERY=50, warmdown-gated)

Files updated

  • records/track_10min_16mb/2026-03-19_StandardOptimal/train_gpt.py
  • records/track_10min_16mb/2026-03-19_StandardOptimal/submission.json
  • records/track_10min_16mb/2026-03-19_StandardOptimal/README.md
  • records/track_10min_16mb/2026-03-19_StandardOptimal/train.log
  • run_modal.py (new helper runner)

@yesbhautik yesbhautik changed the title Record: val_bpb=0.9695 (val-only) + val_bpb=1.1629 (standard) Record Update: val_bpb=0.9271 (val-only) + 1.1465 (standard) Mar 20, 2026
Implement standard mixed int5/int6 quant controls and val-only depth-focused tuning, then add profile/tag-based Modal sweep knobs for reproducible matrix runs. Keep record artifacts aligned to current valid bests (val-only 0.9271, standard 1.1465) while updating PR metadata for clear dual-track reporting.
@yesbhautik
Copy link
Copy Markdown
Author

Executed the full dual-track counterattack plan with new sweepable runner presets and upgraded train scripts. I tested multiple standard and val-only profiles on 8xH100; strongest valid winners remain standard=1.14649233 and val-only=0.92711679 under the 16MB cap. Several lower val-only scores were achieved but exceeded artifact-size constraints.

@cocohearts
Copy link
Copy Markdown
Collaborator

small nit pls move run_modal inside your submission folder
also did you train on the val set for val-only? that doesnt follow the rules have to be train-only
pls update pr

@cocohearts cocohearts added the invalid This doesn't seem right label Mar 20, 2026
…_bpb=1.1391)

3-seed verified standard track submission with per-dim SmearGate, RoPE base
50K, int6 per-row quantization, SWA every 50 steps, and orthogonal init.

Seeds: 1337 (1.1391), 42 (1.1378), 7 (1.1404). Mean: 1.1391. All under 16MB.

Also updated run_modal.py with domv1 profile and semicolon-separated extra_env.
@yesbhautik
Copy link
Copy Markdown
Author

DominationV1: New SOTA Push (mean val_bpb=1.1391)

Added new record folder: records/track_10min_16mb/2026-03-20_DominationV1/

Standard Track Results (3-seed verified)

Seed val_bpb Artifact
42 1.13781 15.38 MB
1337 1.13915 15.42 MB
7 1.14038 15.39 MB
Mean 1.13911

Key Improvements Over Previous Submission (1.1465)

  • 11 layers (was 9-10) — depth-funded by efficient int6 compression
  • Per-dimension SmearGate — 512 independent blend ratios vs scalar gate
  • RoPE base 50K — better position encoding for 2048 seq len
  • Tuned LR 0.02 (was 0.025) — better convergence
  • Compact serialization — flat dict format matching community best practice
  • BigramHash 2048 (was 4096) — saves artifact space

Technique Stack

11L MLP-3x, per-dim SmearGate, BigramHash(2048x128), int6 per-row (mlp+attn), int8 (embed), OrthoInit, Muon WD=0.04, SWA/50 (~30 ckpts), zstd-22, RoPE 50K, sliding eval stride=64.

Val-Only Track

Existing submission (0.9271) remains dominant.

Files Updated

  • records/track_10min_16mb/2026-03-20_DominationV1/ (new folder)
  • run_modal.py (updated with domv1 profile)

Addresses review from @cocohearts:
1. Removed CombinedOptimal (val-only) — trained on validation set, which
   violates the rules (must be train-only).
2. Removed old StandardOptimal (superseded by DominationV1).
3. Moved run_modal.py from repo root into the DominationV1 submission folder.

Only the DominationV1 record folder remains, which trains on full FineWeb
(80 shards, train-only) and evaluates on the fixed validation split.
@yesbhautik
Copy link
Copy Markdown
Author

@cocohearts Thanks for the review — both issues are now fixed:

  1. Moved run_modal.py inside the submission folder at records/track_10min_16mb/2026-03-20_DominationV1/run_modal.py. No root-level file changes remain.

  2. Removed the val-only (CombinedOptimal) submission entirely. You're right — it trained on the validation set, which violates the rules. That folder is deleted.

  3. Also removed the old StandardOptimal folder (superseded by DominationV1).

The PR now only adds a single new folder: records/track_10min_16mb/2026-03-20_DominationV1/ containing train_gpt.py, submission.json, README.md, train.log, and run_modal.py. The DominationV1 submission trains exclusively on the full FineWeb training set (80 shards) — no validation data access during training.

Score: mean val_bpb = 1.1391 (3-seed verified: 1.1378 / 1.1391 / 1.1404), all artifacts under 16MB.

Also merged latest upstream/main to stay in sync with the reset leaderboard.

@yesbhautik yesbhautik changed the title Record Update: val_bpb=0.9271 (val-only) + 1.1465 (standard) Record: 11L Int6 + Per-dim SmearGate + RoPE50K (mean val_bpb=1.1391) Mar 21, 2026
Replaces DominationV1 with V2 adding three new techniques:
- XSA (Exclusive Self Attention) on last 4 layers (arXiv:2603.09078)
- EMA weight averaging (decay=0.997) replacing SWA
- TTT (test-time training, 3-epoch SGD at eval time)

3-seed verified: 1337 (1.1367), 42 (1.1373), 7 (1.1393). Mean: 1.1377.
All artifacts under 16MB. Trains on full FineWeb only (no val data).
@yesbhautik yesbhautik changed the title Record: 11L Int6 + Per-dim SmearGate + RoPE50K (mean val_bpb=1.1391) Record: 11L XSA + EMA + TTT (mean val_bpb=1.1377) Mar 21, 2026
Add a submission-complete DominationV3 folder with a compact no-TTT training path, strict-under-10m logs, and 3-seed verification. This makes the record stronger while keeping validation usage evaluation-only and artifact bytes safely under 16MB.
@yesbhautik yesbhautik changed the title Record: 11L XSA + EMA + TTT (mean val_bpb=1.1377) Record: DominationV3 compact no-TTT (mean val_bpb=1.1349, 3 seeds) Mar 22, 2026
Add 3-epoch SGD test-time training to DominationV3 (on already-graded validation
tokens, rule-compliant). TTT runs in ~46s at eval time after quantization.

3-seed verified: 1337 (1.13337), 7 (1.13374), 42 (1.13374). Mean: 1.13362.
All artifacts under 16MB, all training under 600s.
@yesbhautik yesbhautik changed the title Record: DominationV3 compact no-TTT (mean val_bpb=1.1349, 3 seeds) Record: DominationV3 + TTT (mean val_bpb=1.1336, 3 seeds) Mar 22, 2026
Major improvement via aggressive test-time training: 25 epochs SGD (lr=0.01,
momentum=0.9, all blocks unfrozen) on already-graded validation tokens.

Also added Partial RoPE (16/64 dims), LN Scale (1/sqrt(layer+1)), and
removed XSA to gain ~130 more training steps from step-time savings.

3-seed verified: 1337 (1.12561), 7 (1.12678), 42 (1.12572). Mean: 1.12604.
All artifacts under 16MB, training under 600s, eval (TTT+sliding) under 600s.
@yesbhautik yesbhautik changed the title Record: DominationV3 + TTT (mean val_bpb=1.1336, 3 seeds) Record: DominationV3 + TTT25 (mean val_bpb=1.1260, 3 seeds) Mar 22, 2026
Add GPTQ-lite optimal clip percentile search (5 candidates, min MSE) and
tune TTT learning rate to 0.012 for better adaptation within 16MB budget.

Also adds Partial RoPE (16/64), LN Scale, and two-phase TTT infrastructure.

3-seed verified: 1337 (1.12514), 7 (1.12540), 42 (1.12431). Mean: 1.12495.
All artifacts under 16MB, training under 600s, eval (TTT+sliding) under 600s.
@yesbhautik yesbhautik changed the title Record: DominationV3 + TTT25 (mean val_bpb=1.1260, 3 seeds) Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds) Mar 23, 2026
@yesbhautik
Copy link
Copy Markdown
Author

Hey @cocohearts @0hq @longouyang @welinder
Can you guys please take a look at my new commits, and update the things accordingly.

V2 is an intermediate iteration that V3 fully supersedes.
Keeping only the active submission to reduce review scope.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants