Codex/start 10b token export run by 0hq · Pull Request #2 · openai/parameter-golf

0hq · 2026-03-18T15:33:52Z

No description provided.

Bavadiya leads with 0.9695 BPB (▼0.2549 vs baseline), far ahead of Sam Larson's #2 at 1.1574. The 0.19 BPB gap suggests techniques beyond the known int6+MLP3x+sliding-window formula. https://claude.ai/code/session_01RtoPPgJGUFS7XfcFCPwYtq

Combines best techniques from WarmdownQuantization (openai#1) and SlidingWindow (openai#2): - Int6 quant, FP16 tied embeddings, Late-K passthrough - Batched sliding window eval (stride=64), overtone init, phase-transition resid_mix - Muon decoupled weight decay, AdamW for embeddings/scalars - Novel: QAT with STE in last 30% of training for near-zero quant penalty - Cosine warmdown schedule, higher Muon momentum warmup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Novel techniques from the top 2 leaderboard entries: 1. BigramHash (BIGRAM_BUCKETS=4096, BIGRAM_DIM=128): - Hash consecutive token pairs → embedding lookup → project to model_dim - XOR with coprime multipliers for hash function - Captures local bigram context (~524K params for 4096 buckets) - Used by openai#1 (thwu1, 1.1428 BPB) and openai#2 (Raahil Shah, 1.1458 BPB) 2. SmearGate (SMEAR_GATE=1): - Learned per-dim gate blending current token with previous token - Applied after embedding normalization - Only ~512 params - Used by openai#2 and openai#4 Both are env-var controlled (0=disabled by default). run_v7_full.sh enables everything for the full stack. Also fixed: BigramHash/SmearGate params added to optimizer groups. 1438 lines (62 under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

4 proven optimizations from merged leaderboard entries: - train_seq_len 1024→2048 (both top entries use this) - SWA every 50 steps, start_frac=0.4 (swept optimal by openai#2 author) - grad_clip_norm 0.3 (both top entries use this) - LRs 0.025/0.025/0.035 (from PR openai#287) BigramHash stays at 2048 to avoid artifact size risk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rewrote train_gpt_shared.py with full SOTA stack from #1 leaderboard submission (10L GPT, BigramHash, SmearGate, mixed int5/int6 quant, SWA, Muon WD=0.04, magnitude pruning, zstd-22, sliding window eval). Baseline result: val_bpb = 1.1438 (vs SOTA 1.1428) on 8xH100 in 600s. Added two new ideas on top: - TrigramHashEmbedding(4096 buckets, 32-dim): captures 3-token local patterns beyond bigram. Adds ~147K params (~60-80KB compressed). - Progressive QAT (int5/int6 STE fake-quantize): applied from step 0 via CastedLinear.qat_clip to avoid costly torch.compile recompile. Experiment openai#2 (trigram + QAT at 70% wallclock) scored 1.1630 — worse than baseline because the torch.compile recompile at activation cost ~130s (22% of 600s budget). Fixed by moving QAT to start of training. Other changes: - run_modal.py: migrated from deprecated modal.Mount to Image.add_local_dir, fixed sys.exit(0) traceback to raise RuntimeError only on failure. - research/IDEAS.md: full research log with 11 ranked ideas and experiment tracking table. Next: run openai#3 with QAT-from-start + trigram to test without recompile penalty, then per-layer bitwidth search to squeeze more capacity into the 16MB budget. Made-with: Cursor

Copy of pr374_safe — EMA(0.997) + Tight SWA + QAT(0.15) + warmdown(3500). 3-seed mean 1.1248, best seed 1.1243. #2 overall, #1 non-TTT on the leaderboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copy of pr374_safe — EMA(0.997) + Tight SWA + QAT(0.15) + warmdown(3500). 3-seed mean 1.1248, best seed 1.1243. openai#2 overall, openai#1 non-TTT on the leaderboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace baseline train_gpt.py with the openai#2 leaderboard submission code, stacking all proven techniques: 11L architecture, INT6 QAT with late activation, GPTQ-lite clip search, XSA on last 4 layers, EMA (0.997), BigramHash(2048), SmearGate, Partial RoPE (16/64), LN Scale, ValueEmbedding, sliding-window eval (stride=64), zstd-22 compression, warmdown=3500, Muon momentum→0.99, WD=0.04. Add manifest writing for harness integration. Update harness to make manifest optional. Refresh first-wave experiments and tests for new API.

records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/ Dev-hardware (GTX 1650, SM 7.5, 4 GB VRAM, Windows 11) pipeline porting proven techniques from leaderboard entries openai#1 and openai#2 via 200-step local ablation runs. Features implemented and validated: - NO_COMPILE + math SDP fallback + MAX_VAL_SEQS (GTX 1650 compat, inert on H100) - EMA (decay sweep: 0.997 for competition, 0.97 validated locally) - int6 clip-search quantizer + in-process A/B comparison - Partial RoPE (ROPE_DIMS=16) + LN Scale 1/sqrt(layer+1) - Muon decoupled weight decay (MUON_WD) + AdamW for tok/scalar - MLP_MULT float support (enables MLP_MULT=3.0) Best local result: val_bpb 2.5273 (int8 roundtrip, combined config, 200 steps) Pending: full 11L competition run on 8xH100 with seq_len=2048 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/ Dev-hardware (GTX 1650, SM 7.5, 4 GB VRAM, Windows 11) pipeline porting proven techniques from leaderboard entries openai#1 and openai#2 via 200-step local ablation runs. Features implemented and validated: - NO_COMPILE + math SDP fallback + MAX_VAL_SEQS (GTX 1650 compat, inert on H100) - EMA (decay sweep: 0.997 for competition-scale, 0.97 validated locally) - int6 clip-search quantizer + in-process A/B comparison - Partial RoPE (ROPE_DIMS=16) + LN Scale 1/sqrt(layer+1) - Muon decoupled weight decay (MUON_WD) + AdamW for tok/scalar - MLP_MULT float support (enables MLP_MULT=3.0) Best local result: val_bpb 2.5273 (int8 roundtrip, combined config, 200 steps) Not a leaderboard attempt. Pending: full 11L competition run on 8xH100.

- Warmdown 1200 → 3500 (proven by both our research and openai#2 leaderboard entry) - Muon weight decay WD=0.04 (proven at both Tier 1 and Tier 2 scales) - Adam embedding weight decay WD=0.01 (proven to stack with Muon WD) - LeakyReLU(0.5) activation (used by openai#1 leaderboard entry) Made-with: Cursor

Replace single fixed clip percentile (99.99984) with per-row optimal clip search across 5 percentiles [0.999, 0.9995, 0.9999, 0.99999, 1.0]. Each row uses the percentile giving minimum reconstruction MSE. Deterministic, zero training cost. Used by openai#2 submission (est. -0.003 q_gap).

1. RevDEQ: single shared block iterated 5 times (Constraint openai#1) 2. MLA with Gated Attention: low-rank KV, decoupled RoPE, sigmoid gates (Constraint openai#3) 3. Soft Dense Routing: router + per-expert sigmoid gate on MLP groups (Constraint openai#2) All constraints satisfied. Model: 9.4M params. Fix DDP unused param issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace split-dim experts (dim//E) with full-dim low-rank (dim→rank→dim) so every expert sees all dimensions through a rank bottleneck - MoS routing: pure softmax convex combination (Mixtape paper), removed sigmoid gates (expert_gate_ctp/ntp_logits) - Added configurable attn_expert_rank / mlp_expert_rank hyperparameters - Added MoS eval diagnostics: usage/entropy/balance_cv for CTP+NTP - Updated metrics plot: expert usage shows min/max/mean/median per component (Attn, MLP, MoS CTP, MoS NTP) for scalability - Updated CLAUDE.md Constraint openai#2 with full-dim + Mixtape clarifications - Result: val_bpb=1.4094, attn_cv 0.32→0.22, artifact 15.4MB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

0hq and others added 30 commits March 16, 2026 13:45

Initial public release

4ec17f9

remove prelaunch setup instructions

f1ec1b1

Update README.md

87c35fe

Update README.md

ecf46c9

Update README.md

daf43ea

update readme with new template

0eaf892

Update README.md

e7185f6

small cleanup

b144f2a

Merge branch 'main' of github.com:openai/parameter-golf

41220a8

update readme to save stuff to pod instead

8f1baaf

update readme again

4690198

rm docker

562dea6

clean up eps

c386f99

Improved baseline

e34e708

small fixes

8feb523

small code cleanups

598be66

Merge branch 'main' of github.com:openai/parameter-golf

4dbae29

Fix security issues.

acba289

Update naive baseline, remove other records

1ea2edf

Create LICENSE

ef9b9f0

Licenses

f3cf3cb

save

c0e32d4

nits

d6359a4

cleanup data

854ac94

Remove tokenizer check

890535d

save

69f43a7

update records

5d1749a

hide oai internal

68add92

remove oai internal

46555ee

clean up

5e067f2

0hq closed this Mar 18, 2026

0hq force-pushed the main branch from 90ad116 to a175ca1 Compare March 18, 2026 16:05

0hq deleted the codex/start-10b-token-export-run branch March 18, 2026 16:32

pleasedontddosme mentioned this pull request Mar 20, 2026

Add Combined Int6 + QAT + Sliding Window submission #149

Draft

mrdavtan mentioned this pull request Mar 20, 2026

Non-record: Negative findings on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization #212

Closed

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

teddyoweh mentioned this pull request Mar 24, 2026

Record: 11L Sidecar48 + Enhanced TTT (cosine LR, 20 epochs) — 1.0698 BPB (3-seed mean) #581

Closed

vibhu1510 mentioned this pull request Mar 25, 2026

11L LeakyReLU² + Partial RoPE + Int6 + EMA (~1.1200 BPB) vibhu1510/parameter-golf-vibhu#1

Closed

4 tasks

gthgomez mentioned this pull request Mar 25, 2026

[WIP] Non-record: Local Ablation Pipeline — EMA + Int6 + Partial RoPE (GTX 1650) #682

Open

ndokutovich mentioned this pull request Mar 25, 2026

Record: Curriculum Learning + LeakyReLU(0.9)² + 7-gram Backoff (val_bpb=0.9633) #764

Open

3 tasks

himanalot mentioned this pull request Mar 27, 2026

Record: Nacrith Log-Bias + Full-Rescore N-gram — val_bpb 0.00000035 (3-seed mean) #959

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codex/start 10b token export run#2

Codex/start 10b token export run#2
0hq wants to merge 30 commits intomainfrom
codex/start-10b-token-export-run

0hq commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

0hq commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants