Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246)#374
Open
unnir wants to merge 1 commit intoopenai:mainfrom
Open
Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246)#374unnir wants to merge 1 commit intoopenai:mainfrom
unnir wants to merge 1 commit intoopenai:mainfrom
Conversation
…A4 (val_bpb: 1.1246)
felipe-parodi
added a commit
to felipe-parodi/parameter-golf
that referenced
this pull request
Mar 21, 2026
joelnishanth
added a commit
to joelnishanth/parameter-golf
that referenced
this pull request
Mar 21, 2026
Made-with: Cursor
kasimte
pushed a commit
to kasimte/parameter-golf
that referenced
this pull request
Mar 22, 2026
Three low-risk additions: - Memory Tokens (64 learnable embeddings, -0.014 A/B, PR openai#352) - Backout Connection (learned mid-layer subtraction, -0.007, PR openai#339) - Tight SWA (scale<0.2, every 50, replacing EMA. PR openai#374) Bugs found and fixed during review: - memory_tokens/backout_lambda not in optimizer groups (code review) - memory_tokens appended to embed_params AFTER optimizer creation (/refine) - Dead encoder-loop h_mid check removed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio
added a commit
to anthony-maio/parameter-golf
that referenced
this pull request
Mar 22, 2026
From arXiv:2603.09078. Projects out the self-value component from attention output, forcing the network to use contextual information. Applied via GQA-aware zero-alloc view reshape on last 4 of 11 layers. Both top unmerged submissions (PR openai#374 at 1.1246 and PR openai#379 at 1.1260) use XSA as a key technique. Full next-gen stack now includes: 11L, XSA, Partial RoPE 16/64, Late QAT STE, Tight SWA, GPTQ-lite, LN Scale, FA3, SmearGate, BigramHash, int6+zstd, Muon WD, OrthoInit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EthanYangTW
added a commit
to EthanYangTW/parameter-golf
that referenced
this pull request
Mar 22, 2026
Fork of unnir's openai#374 (1.1246 BPB) with TTT added: - 11L, XSA4, Partial RoPE 16/64, LN Scale, Tight SWA - Shared VE128, SmearGate, BigramHash 2048 - TTT: 25 epochs SGD on val data post-quantization - Trimmed to 1476 lines (under 1500 limit)
EthanYangTW
added a commit
to EthanYangTW/parameter-golf
that referenced
this pull request
Mar 22, 2026
Two-phase TTT on PR openai#374 base: phase 1 norm-only recalibration (100ep Adam), phase 2 selective-freeze last 2 blocks (15ep SGD). Artifact 15.76MB.
filipviz
added a commit
to filipviz/parameter-golf
that referenced
this pull request
Mar 22, 2026
Remove scalar_beta1 and muon cooldown code (both hurt or neutral). Add WD results table to README. Tighten SWA threshold to scale<0.2 (matching PR openai#374). Disable Late QAT (was dead code). Add submission template.
6 tasks
This was referenced Mar 22, 2026
kasimte
pushed a commit
to kasimte/parameter-golf
that referenced
this pull request
Mar 22, 2026
Fork the fastest proven config (PR openai#374, 86ms/step, 1.1246 BPP) and add full-weight TTT (SGD, 3 epochs, freeze blocks 0-1). Predicted: ~1.122. Base: Tight SWA + Shared VE128 + XSA4 + Partial RoPE + LN Scale + Late QAT Added: TTT (ttt_adapt function, 6 hyperparams, inserted after dequant) Trimmed from 1676 to 1473 lines (comments/docstrings removed) Code-reviewed: TTT insertion point correct, RoPE verified, Late QAT present. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kasimte
pushed a commit
to kasimte/parameter-golf
that referenced
this pull request
Mar 22, 2026
Beats SOTA (1.1428) by 0.0129 nats across 3 seeds (1337, 7, 99). Built on PR openai#374 by @unnir with added test-time training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kasimte
pushed a commit
to kasimte/parameter-golf
that referenced
this pull request
Mar 22, 2026
Beats SOTA (1.1428) by 0.0129 nats across 3 seeds (1337, 7, 99). Built on PR openai#374 by @unnir with added test-time training.
5 tasks
rarce
added a commit
to rarce/parameter-golf
that referenced
this pull request
Mar 22, 2026
original_model.md: - Discard depth recurrence (amplifies quant error 900×, throughput loss) - New direction: eval-time optimization stack (PPM-C + GPTQ-lite) - Document all our experiment results (v3, v4, v4_30m, ringgolf) - Add TTT/XSA interaction findings (PR openai#303: mutually exclusive) - Add PR openai#375 meta-insight (1ms overhead = 0.006 BPB) - 4-phase execution plan targeting PPM-C as original contribution review_pr_records_track_10min_16mb.md: - Add 2026-03-22 update with PRs openai#374, openai#379, openai#390, openai#375, openai#303, openai#363 - New SOTA at 1.1246 (PR openai#374: Tight SWA + VE128) - Document negative results from $500 compute spend (PR openai#375) - Unexplored opportunities: PPM-C, Neural Cache review_records_track_10min_16mb.md: - Add timestamp note (17 records, no changes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rarce
added a commit
to rarce/parameter-golf
that referenced
this pull request
Mar 22, 2026
Full SOTA reproduction stack with novel additions: Architecture: - Partial RoPE (16/64 dims) — position-free attention on 75% of dims - LN Scale (1/sqrt(layer+1)) — damp deeper layers - XSA on last 4 layers — GQA-aware orthogonal self-value debiasing - Shared Value Embedding (dim=128, layers 9,10) — 1 table, per-layer scales - SmearGate, BigramHash (existing) Training: - Tight SWA (scale<0.2) — only average last ~600 steps, zero penalty - Late QAT (existing) - Muon WD=0.038, logit softcap=30 Post-training: - GPTQ-lite: per-tensor clip ratio search (5 candidates) minimizing reconstruction error. Zero training cost. Eval-time (NOVEL): - PPM-C context mixer: order-2 per-document n-gram model mixed with neural log-probs at alpha=0.95. Zero artifact cost, ~60 LOC. 1325 lines (under 1500 cap). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rarce
added a commit
to rarce/parameter-golf
that referenced
this pull request
Mar 22, 2026
From clean upstream base, added: - Hyperparams: 11L, MLP3x, seq2048, batch786K, Muon 0.99, WD=0.04, Partial RoPE, LN Scale, XSA, VE, Tight SWA, Late QAT, GPTQ-lite - Modules: SmearGate, SharedValueEmbedding, fake_quantize, CastedLinear+QAT - Partial RoPE in Rotary + apply_rotary_emb TODO: CausalSelfAttention (XSA+VE), Block (LN Scale), GPT (wire all), Muon WD, training loop (SWA, Late QAT, EMA), quantization (int6, GPTQ) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rarce
added a commit
to rarce/parameter-golf
that referenced
this pull request
Mar 22, 2026
Clean rewrite from upstream base with full SOTA stack: Architecture: 11L, MLP3x, SmearGate, SharedVE128 (layers 9,10), Partial RoPE (16/64 dims), LN Scale (1/sqrt(i+1)), XSA4 (GQA-aware), U-Net skips, logit softcap 30, tied embeddings. Training: Muon lr=0.025 momentum=0.99 WD=0.04, batch 786K, seq 2048, warmdown 3000, grad_clip 0.3. Late QAT (STE int6 when lr_scale<0.1). Tight SWA (scale<0.2, every 50 steps, uniform average). Quantization: GPTQ-lite (5-point clip search per tensor), int6 step=4 on middle layers (3-7), FP16 embedding passthrough. GPT class simplified to take Hyperparameters directly. 1172 lines (under 1500 cap). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Mar 23, 2026
rarce
added a commit
to rarce/parameter-golf
that referenced
this pull request
Mar 23, 2026
Submission combining PR openai#374 frontier techniques with MLP width optimization and GPTQ-lite clip search: - 11L/512d, MLP hidden=1408, 25.2M params - Partial RoPE (16/64), LN Scale, XSA4, Shared VE128 - Tight SWA (scale<0.2), Late QAT (lr_scale<0.1) - GPTQ-lite per-tensor clip search (5 candidates) - Int6 layers 1-9 + int8 layers 0,10 + FP16 embed - zstd-22 compression → 15.95MB artifact - 4071 steps @ 137ms/step on 8×H100 SXM val_bpb: 1.1804 (single seed 1337) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
6 tasks
This was referenced Mar 23, 2026
Collaborator
|
unfortunately not statistically significant win over previous, sorry |
demouo
added a commit
to demouo/parameter-golf
that referenced
this pull request
Apr 1, 2026
- Partial RoPE: apply RoPE to first N dims of head_dim (ROPE_DIMS env var) - LN Scale: multiply sublayer inputs by 1/sqrt(layer+1) (LN_SCALE env var) - Both from top competition records (PR openai#374, openai#414) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
demouo
added a commit
to demouo/parameter-golf
that referenced
this pull request
Apr 1, 2026
- XSA: subtract self-value projection in attention output (from PR openai#374) Configurable via XSA_LAST_N env var (apply to last N layers) - OrthoInit: orthogonal weight initialization for all linear layers - Both from top competition records (~1.12 bpb) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
demouo
added a commit
to demouo/parameter-golf
that referenced
this pull request
Apr 1, 2026
- SmearGate: learned per-dim gate mixing current token with predecessor - Applied before RMSNorm in embedding layer - From top competition records (PR openai#374) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246)
Key Innovation: Tight SWA
SWA checkpoint collection restricted to scale<0.2 (last ~600 steps), every 50 steps. This eliminates the SWA quality penalty (post-SWA BPB = pre-SWA BPB) while maintaining quantization-friendly weight averaging. Standard SWA (scale<0.5) averages stale checkpoints that hurt final quality.
Architecture
Training
Quantization
Results
Run