Record: 11L TrigramHash + ValueResidual + GradQuant + Cosine TTT (mean val_bpb=1.0887, best 1.0879)#486
Closed
ndokutovich wants to merge 3 commits intoopenai:mainfrom
Closed
Record: 11L TrigramHash + ValueResidual + GradQuant + Cosine TTT (mean val_bpb=1.0887, best 1.0879)#486ndokutovich wants to merge 3 commits intoopenai:mainfrom
ndokutovich wants to merge 3 commits intoopenai:mainfrom
Conversation
… val_bpb=1.1132, best 1.1101)
sahiee-dev
added a commit
to sahiee-dev/parameter-golf
that referenced
this pull request
Mar 23, 2026
Full stack on thwu1 base (1.1428): - Value Residual: lambda_v * v0 shortcut to every block, init=0 - Gated Attention: learned scalar gate on attn output, init=1 - XSA: orthogonal self-value removal, last 4 layers - EMA: decay=0.9999 shadow model used at final eval - AdamW TTT: lr=0.001, 3 epochs on val tokens before eval - BigramHash(10240): restored to full size after ablation Techniques consistent with PR openai#490 (1.0891) and PR openai#486 (1.0887). Expected range: 1.08-1.10 on 8xH100s. Trigram ablation confirmed negative at small scale — removed.
4 tasks
sofiabod
added a commit
to sofiabod/parameter-golf
that referenced
this pull request
Mar 23, 2026
…enai#486) - 30 epochs AdamW(lr=0.0005) on val tokens with cosine LR decay - per-layer LR: 3x for mlp.proj (high quant error), 0.5x for mlp.fc - DDP gradient sync via all_reduce(AVG) + grad clip 1.0 - keep LeakyReLU(0.5)^2 from exp48 - expected: ~0.06 BPB gain (1.127 → ~1.07) - modal timeout 3600s for 30-epoch TTT
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887): - train_batch_tokens: 524K → 786K (all top entries use this) - bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240) - grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping) - Star-ReLU and TrigramHash enabled in all run scripts
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 23, 2026
Two-phase TTT pipeline (novel combination): - Phase 1: In-Place TTT — updates MLP output projections per-document (ICLR 2026) - Phase 2: Per-doc LoRA TTT — adapts Q/V/LM head with surprise gating (top-K tokens) Architecture: PR openai#486 base (11L, TrigramHash, ValueResidual, GradQuant) + LeakyReLU(0.5)^2 + eval-only XSA on all layers + Partial RoPE + LN Scale Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
Run 1 showed: - Pre-quant val_bpb: 1.1757 - Post-quant sliding window: 1.3569 - Quantization penalty: 0.18 bpb (expected ~0.003) Root cause: Our GPTQ implementation (ported from PR openai#535) produced WORSE quantization than standard per-row int6. PR openai#486 base doesn't use GPTQ at all. Possible issues: bad Hessian calibration, numerical instability in Cholesky decomposition, or name mismatch between hooks and state dict keys. Fix: Disable GPTQ, revert to standard quantization path. GPTQ code preserved for future debugging. Also confirmed: TTT bpb formula is algebraically correct. The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
In-Place TTT: loss INCREASES (2.63+), 955s+ eval time. Harmful. GradQuant int5/int6 mix: 34KB over 16MB even without int7. PR openai#486 baseline reproduced at 1.1249 (within seed variance of 1.1233). Added lessons 13-16 to CLAUDE.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
|
As far as I can tell here, this proposed TTT scheme trains on the validation set by reporting the score on a doc after its weights have adapted to it, rendering this unsound for the purposes of this competition. |
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
Two-phase TTT pipeline (novel combination): - Phase 1: In-Place TTT — updates MLP output projections per-document (ICLR 2026) - Phase 2: Per-doc LoRA TTT — adapts Q/V/LM head with surprise gating (top-K tokens) Architecture: PR openai#486 base (11L, TrigramHash, ValueResidual, GradQuant) + LeakyReLU(0.5)^2 + eval-only XSA on all layers + Partial RoPE + LN Scale Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
Run 1 showed: - Pre-quant val_bpb: 1.1757 - Post-quant sliding window: 1.3569 - Quantization penalty: 0.18 bpb (expected ~0.003) Root cause: Our GPTQ implementation (ported from PR openai#535) produced WORSE quantization than standard per-row int6. PR openai#486 base doesn't use GPTQ at all. Possible issues: bad Hessian calibration, numerical instability in Cholesky decomposition, or name mismatch between hooks and state dict keys. Fix: Disable GPTQ, revert to standard quantization path. GPTQ code preserved for future debugging. Also confirmed: TTT bpb formula is algebraically correct. The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
In-Place TTT: loss INCREASES (2.63+), 955s+ eval time. Harmful. GradQuant int5/int6 mix: 34KB over 16MB even without int7. PR openai#486 baseline reproduced at 1.1249 (within seed variance of 1.1233). Added lessons 13-16 to CLAUDE.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: 11L TrigramHash + ValueResidual + GradQuant + AdamW TTT
val_bpb = 1.1101 (sliding window stride=64, best seed 2024) | 15.34 MB artifact | 8xH100 SXM, 600s
Three Novel Contributions
Built on the PR #398/#442 baseline (11L EMA + AdamW TTT), this submission adds three techniques that improve quality-per-step:
TrigramHash Embedding (4096 buckets, dim=128) — extends BigramHash to 3-token context via
xor(36313*t[i], 27191*t[i-1], 51497*t[i-2]) % 4095. Added to token embedding before transformer blocks.Value Residual (ResFormer, arXiv:2410.17897) — cache V vectors from first attention layer, blend into all subsequent layers via learned scalars:
v = lambda[0] * v0 + lambda[1] * v. Just 22 params total.Gradient-Guided Adaptive Quantization — accumulate per-tensor gradient sensitivity during last 10% of warmdown, then quantize adaptively: top 10% sensitivity → Int7, middle 70% → Int6, bottom 20% → Int5.
Results (3-seed, 8xH100 SXM)
Mean: 1.1132 | Std: 0.0040
Ablation
Despite 22% fewer steps due to compute overhead, the three additions improve BPB by -0.023.
Run Command
All hyperparameters are defaults — no env vars needed.