Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246) by unnir · Pull Request #374 · openai/parameter-golf

unnir · 2026-03-21T22:00:01Z

Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246)

Key Innovation: Tight SWA

SWA checkpoint collection restricted to scale<0.2 (last ~600 steps), every 50 steps. This eliminates the SWA quality penalty (post-SWA BPB = pre-SWA BPB) while maintaining quantization-friendly weight averaging. Standard SWA (scale<0.5) averages stale checkpoints that hurt final quality.

Architecture

11 transformer layers, 512-dim, 8 heads (4 KV heads, GQA)
3x MLP expansion with relu-squared activation
Efficient Partial XSA on last 4 layers (GQA-aware, zero-alloc)
Partial RoPE (16/64 dims) + NTK-aware scaling
LN Scale Factor 1/sqrt(layer_idx+1)
U-Net skip connections (5 encoder, 6 decoder)
SmearGate + BigramHash (2048 buckets, dim=128)
Shared Value Embedding (dim=128, layers 9,10) — 1 table, per-layer learned scales
FlashAttention 3 (Hopper)
Orthogonal init with proj scaling by 1/sqrt(2*num_layers)
Logit softcap 30.0, tied embeddings

Training

Muon optimizer (matrices): lr=0.025, momentum=0.99 (warmup 0.92→0.99 over 1500 steps), WD=0.04
AdamW (embeddings): lr=0.035, (scalars): lr=0.025, WD=0.04
Gradient clip: 0.3
Batch: 786,432 tokens/step, seq_len=2048
Warmdown: 3000 iters (wallclock-based)
Tight SWA: every 50 steps when scale<0.2 (12 checkpoints from last 600 steps)
Late QAT: STE int6 fake-quantization when LR scale<0.1

Quantization

Int6 per-row for MLP + attention weights
Int8 per-row for embeddings
Control tensors in fp32
zstd level 22 compression

Results

6942 steps in 600s at 86.4ms/step
Pre-quant val_bpb: 1.1407
Post-SWA val_bpb: 1.1407 (zero SWA penalty!)
Quant gap: 0.008
Sliding window val_bpb: 1.1246
Artifact size: 15,706,024 bytes (15.71 MB)

Run

NUM_LAYERS=11 MLP_MULT=3.0 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 \
SWA_ENABLED=1 SWA_EVERY=50 LATE_QAT_THRESHOLD=0.1 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
BIGRAM_VOCAB_SIZE=2048 BIGRAM_DIM=128 ADAM_WD=0.04 MUON_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
torchrun --nproc_per_node=8 train_gpt.py

…A4 (val_bpb: 1.1246)

Made-with: Cursor

Three low-risk additions: - Memory Tokens (64 learnable embeddings, -0.014 A/B, PR openai#352) - Backout Connection (learned mid-layer subtraction, -0.007, PR openai#339) - Tight SWA (scale<0.2, every 50, replacing EMA. PR openai#374) Bugs found and fixed during review: - memory_tokens/backout_lambda not in optimizer groups (code review) - memory_tokens appended to embed_params AFTER optimizer creation (/refine) - Dead encoder-loop h_mid check removed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

From arXiv:2603.09078. Projects out the self-value component from attention output, forcing the network to use contextual information. Applied via GQA-aware zero-alloc view reshape on last 4 of 11 layers. Both top unmerged submissions (PR openai#374 at 1.1246 and PR openai#379 at 1.1260) use XSA as a key technique. Full next-gen stack now includes: 11L, XSA, Partial RoPE 16/64, Late QAT STE, Tight SWA, GPTQ-lite, LN Scale, FA3, SmearGate, BigramHash, int6+zstd, Muon WD, OrthoInit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fork of unnir's openai#374 (1.1246 BPB) with TTT added: - 11L, XSA4, Partial RoPE 16/64, LN Scale, Tight SWA - Shared VE128, SmearGate, BigramHash 2048 - TTT: 25 epochs SGD on val data post-quantization - Trimmed to 1476 lines (under 1500 limit)

Two-phase TTT on PR openai#374 base: phase 1 norm-only recalibration (100ep Adam), phase 2 selective-freeze last 2 blocks (15ep SGD). Artifact 15.76MB.

Remove scalar_beta1 and muon cooldown code (both hurt or neutral). Add WD results table to README. Tighten SWA threshold to scale<0.2 (matching PR openai#374). Disable Late QAT (was dead code). Add submission template.

Fork the fastest proven config (PR openai#374, 86ms/step, 1.1246 BPP) and add full-weight TTT (SGD, 3 epochs, freeze blocks 0-1). Predicted: ~1.122. Base: Tight SWA + Shared VE128 + XSA4 + Partial RoPE + LN Scale + Late QAT Added: TTT (ttt_adapt function, 6 hyperparams, inserted after dequant) Trimmed from 1676 to 1473 lines (comments/docstrings removed) Code-reviewed: TTT insertion point correct, RoPE verified, Late QAT present. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@unnir

Beats SOTA (1.1428) by 0.0129 nats across 3 seeds (1337, 7, 99). Built on PR openai#374 by @unnir with added test-time training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@unnir

Beats SOTA (1.1428) by 0.0129 nats across 3 seeds (1337, 7, 99). Built on PR openai#374 by @unnir with added test-time training.

original_model.md: - Discard depth recurrence (amplifies quant error 900×, throughput loss) - New direction: eval-time optimization stack (PPM-C + GPTQ-lite) - Document all our experiment results (v3, v4, v4_30m, ringgolf) - Add TTT/XSA interaction findings (PR openai#303: mutually exclusive) - Add PR openai#375 meta-insight (1ms overhead = 0.006 BPB) - 4-phase execution plan targeting PPM-C as original contribution review_pr_records_track_10min_16mb.md: - Add 2026-03-22 update with PRs openai#374, openai#379, openai#390, openai#375, openai#303, openai#363 - New SOTA at 1.1246 (PR openai#374: Tight SWA + VE128) - Document negative results from $500 compute spend (PR openai#375) - Unexplored opportunities: PPM-C, Neural Cache review_records_track_10min_16mb.md: - Add timestamp note (17 records, no changes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full SOTA reproduction stack with novel additions: Architecture: - Partial RoPE (16/64 dims) — position-free attention on 75% of dims - LN Scale (1/sqrt(layer+1)) — damp deeper layers - XSA on last 4 layers — GQA-aware orthogonal self-value debiasing - Shared Value Embedding (dim=128, layers 9,10) — 1 table, per-layer scales - SmearGate, BigramHash (existing) Training: - Tight SWA (scale<0.2) — only average last ~600 steps, zero penalty - Late QAT (existing) - Muon WD=0.038, logit softcap=30 Post-training: - GPTQ-lite: per-tensor clip ratio search (5 candidates) minimizing reconstruction error. Zero training cost. Eval-time (NOVEL): - PPM-C context mixer: order-2 per-document n-gram model mixed with neural log-probs at alpha=0.95. Zero artifact cost, ~60 LOC. 1325 lines (under 1500 cap). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

From clean upstream base, added: - Hyperparams: 11L, MLP3x, seq2048, batch786K, Muon 0.99, WD=0.04, Partial RoPE, LN Scale, XSA, VE, Tight SWA, Late QAT, GPTQ-lite - Modules: SmearGate, SharedValueEmbedding, fake_quantize, CastedLinear+QAT - Partial RoPE in Rotary + apply_rotary_emb TODO: CausalSelfAttention (XSA+VE), Block (LN Scale), GPT (wire all), Muon WD, training loop (SWA, Late QAT, EMA), quantization (int6, GPTQ) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Clean rewrite from upstream base with full SOTA stack: Architecture: 11L, MLP3x, SmearGate, SharedVE128 (layers 9,10), Partial RoPE (16/64 dims), LN Scale (1/sqrt(i+1)), XSA4 (GQA-aware), U-Net skips, logit softcap 30, tied embeddings. Training: Muon lr=0.025 momentum=0.99 WD=0.04, batch 786K, seq 2048, warmdown 3000, grad_clip 0.3. Late QAT (STE int6 when lr_scale<0.1). Tight SWA (scale<0.2, every 50 steps, uniform average). Quantization: GPTQ-lite (5-point clip search per tensor), int6 step=4 on middle layers (3-7), FP16 embedding passthrough. GPT class simplified to take Hyperparameters directly. 1172 lines (under 1500 cap). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Submission combining PR openai#374 frontier techniques with MLP width optimization and GPTQ-lite clip search: - 11L/512d, MLP hidden=1408, 25.2M params - Partial RoPE (16/64), LN Scale, XSA4, Shared VE128 - Tight SWA (scale<0.2), Late QAT (lr_scale<0.1) - GPTQ-lite per-tensor clip search (5 candidates) - Int6 layers 1-9 + int8 layers 0,10 + FP16 embed - zstd-22 compression → 15.95MB artifact - 4071 steps @ 137ms/step on 8×H100 SXM val_bpb: 1.1804 (single seed 1337) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cocohearts · 2026-03-23T17:25:28Z

unfortunately not statistically significant win over previous, sorry

- Partial RoPE: apply RoPE to first N dims of head_dim (ROPE_DIMS env var) - LN Scale: multiply sublayer inputs by 1/sqrt(layer+1) (LN_SCALE env var) - Both from top competition records (PR openai#374, openai#414) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- XSA: subtract self-value projection in attention output (from PR openai#374) Configurable via XSA_LAST_N env var (apply to last N layers) - OrthoInit: orthogonal weight initialization for all linear layers - Both from top competition records (~1.12 bpb) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- SmearGate: learned per-dim gate mixing current token with predecessor - Applied before RMSNorm in embedding layer - From top competition records (PR openai#374) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XS…

1323a3a

…A4 (val_bpb: 1.1246)

notapplica mentioned this pull request Mar 21, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 21, 2026

feat: add SWA_THRESHOLD for tight SWA (PR openai#374 technique)

1f02387

joelnishanth added a commit to joelnishanth/parameter-golf that referenced this pull request Mar 21, 2026

Update to PR openai#374 base (1.1246 BPB) new leader with FA3 fallback

96604e9

Made-with: Cursor

dannywillowliu-uchi mentioned this pull request Mar 22, 2026

Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257) #379

Open

newjordan mentioned this pull request Mar 22, 2026

Record: 11L + EMA + Tight SWA + QAT0.15 + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243) #401

Closed

EthanYangTW mentioned this pull request Mar 22, 2026

Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (val_bpb=1.1216) #410

Closed

signalrush mentioned this pull request Mar 22, 2026

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414

Merged

6 tasks

This was referenced Mar 22, 2026

Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (val_bpb=1.1216) #415

Closed

Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (3-seed mean val_bpb=1.1227) #417

Closed

abaybektursun mentioned this pull request Mar 22, 2026

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) #399

Open

kasimte mentioned this pull request Mar 22, 2026

Record: 11L Tight SWA + VE128 + XSA4 + TTT (3-seed mean val_bpb=1.1299) #455

Open

5 tasks

This was referenced Mar 23, 2026

Record: GPTQ + Legal TTT (3-seed mean val_bpb=1.1195) #503

Closed

Record: GPTQ + Legal TTT (3-seed mean val_bpb=1.1195) #528

Closed

Record: GPTQ + Legal TTT (3-seed mean val_bpb=1.1195) #529

Closed

rarce mentioned this pull request Mar 23, 2026

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #534

Closed

6 tasks

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) #537

Open

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252) #526

Open

rarce mentioned this pull request Mar 23, 2026

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #543

Open

6 tasks

This was referenced Mar 23, 2026

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461

Open

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB) #456

Open

valerio-oai mentioned this pull request Mar 24, 2026

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean) #473

Closed

Christopher-Lee-McClendon mentioned this pull request Mar 24, 2026

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT #598

Open

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

abaybektursun mentioned this pull request Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

aruniyer mentioned this pull request Mar 27, 2026

Non-record: 15L Depth Recurrence + LeakyReLU² — BI-guided weight tying (val_bpb=1.1360) #857

Draft

abaybektursun mentioned this pull request Mar 28, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Merged

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

This was referenced Mar 31, 2026

Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean) #1166

Open

Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225) #1170

Open

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246)#374

Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246)#374
unnir wants to merge 1 commit intoopenai:mainfrom
unnir:unnir/v38-tight-swa-record

unnir commented Mar 21, 2026

Uh oh!

cocohearts commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

unnir commented Mar 21, 2026

Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246)

Key Innovation: Tight SWA

Architecture

Training

Quantization

Results

Run

Uh oh!

cocohearts commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants