Skip to content

Record: 11L TrigramHash + ValueResidual + GradQuant + Cosine TTT (mean val_bpb=1.0887, best 1.0879)#486

Closed
ndokutovich wants to merge 3 commits intoopenai:mainfrom
ndokutovich:main
Closed

Record: 11L TrigramHash + ValueResidual + GradQuant + Cosine TTT (mean val_bpb=1.0887, best 1.0879)#486
ndokutovich wants to merge 3 commits intoopenai:mainfrom
ndokutovich:main

Conversation

@ndokutovich
Copy link
Copy Markdown

Record: 11L TrigramHash + ValueResidual + GradQuant + AdamW TTT

val_bpb = 1.1101 (sliding window stride=64, best seed 2024) | 15.34 MB artifact | 8xH100 SXM, 600s

Three Novel Contributions

Built on the PR #398/#442 baseline (11L EMA + AdamW TTT), this submission adds three techniques that improve quality-per-step:

  1. TrigramHash Embedding (4096 buckets, dim=128) — extends BigramHash to 3-token context via xor(36313*t[i], 27191*t[i-1], 51497*t[i-2]) % 4095. Added to token embedding before transformer blocks.

  2. Value Residual (ResFormer, arXiv:2410.17897) — cache V vectors from first attention layer, blend into all subsequent layers via learned scalars: v = lambda[0] * v0 + lambda[1] * v. Just 22 params total.

  3. Gradient-Guided Adaptive Quantization — accumulate per-tensor gradient sensitivity during last 10% of warmdown, then quantize adaptively: top 10% sensitivity → Int7, middle 70% → Int6, bottom 20% → Int5.

Results (3-seed, 8xH100 SXM)

Seed Steps Sliding BPB (s64) Artifact
42 5,190 1.1177 15.54 MB
1337 5,925 1.1118 15.76 MB
2024 5,930 1.1101 15.34 MB

Mean: 1.1132 | Std: 0.0040

Ablation

Config Steps BPB Delta
Baseline (PR #398 stack) 6,613 1.1403
+ TrigramHash + ValueResidual + GradQuant 5,190 1.1177 -0.023

Despite 22% fewer steps due to compute overhead, the three additions improve BPB by -0.023.

Run Command

SEED=2024 torchrun --standalone --nproc_per_node=8 train_gpt.py

All hyperparameters are defaults — no env vars needed.

@ndokutovich ndokutovich changed the title Record: 11L TrigramHash + ValueResidual + GradQuant + AdamW TTT (mean val_bpb=1.1132, best 1.1101) Record: 11L TrigramHash + ValueResidual + GradQuant + Cosine TTT (mean val_bpb=1.0887, best 1.0879) Mar 23, 2026
sahiee-dev added a commit to sahiee-dev/parameter-golf that referenced this pull request Mar 23, 2026
Full stack on thwu1 base (1.1428):
- Value Residual: lambda_v * v0 shortcut to every block, init=0
- Gated Attention: learned scalar gate on attn output, init=1
- XSA: orthogonal self-value removal, last 4 layers
- EMA: decay=0.9999 shadow model used at final eval
- AdamW TTT: lr=0.001, 3 epochs on val tokens before eval
- BigramHash(10240): restored to full size after ablation

Techniques consistent with PR openai#490 (1.0891) and PR openai#486 (1.0887).
Expected range: 1.08-1.10 on 8xH100s.
Trigram ablation confirmed negative at small scale — removed.
sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 23, 2026
…enai#486)

- 30 epochs AdamW(lr=0.0005) on val tokens with cosine LR decay
- per-layer LR: 3x for mlp.proj (high quant error), 0.5x for mlp.fc
- DDP gradient sync via all_reduce(AVG) + grad clip 1.0
- keep LeakyReLU(0.5)^2 from exp48
- expected: ~0.06 BPB gain (1.127 → ~1.07)
- modal timeout 3600s for 30-epoch TTT
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887):
- train_batch_tokens: 524K → 786K (all top entries use this)
- bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240)
- grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping)
- Star-ReLU and TrigramHash enabled in all run scripts
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 23, 2026
Two-phase TTT pipeline (novel combination):
- Phase 1: In-Place TTT — updates MLP output projections per-document (ICLR 2026)
- Phase 2: Per-doc LoRA TTT — adapts Q/V/LM head with surprise gating (top-K tokens)

Architecture: PR openai#486 base (11L, TrigramHash, ValueResidual, GradQuant) +
LeakyReLU(0.5)^2 + eval-only XSA on all layers + Partial RoPE + LN Scale

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
Run 1 showed:
- Pre-quant val_bpb: 1.1757
- Post-quant sliding window: 1.3569
- Quantization penalty: 0.18 bpb (expected ~0.003)

Root cause: Our GPTQ implementation (ported from PR openai#535) produced
WORSE quantization than standard per-row int6. PR openai#486 base doesn't
use GPTQ at all. Possible issues: bad Hessian calibration, numerical
instability in Cholesky decomposition, or name mismatch between
hooks and state dict keys.

Fix: Disable GPTQ, revert to standard quantization path.
GPTQ code preserved for future debugging.

Also confirmed: TTT bpb formula is algebraically correct.
The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
In-Place TTT: loss INCREASES (2.63+), 955s+ eval time. Harmful.
GradQuant int5/int6 mix: 34KB over 16MB even without int7.
PR openai#486 baseline reproduced at 1.1249 (within seed variance of 1.1233).

Added lessons 13-16 to CLAUDE.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@valerio-oai
Copy link
Copy Markdown
Contributor

As far as I can tell here, this proposed TTT scheme trains on the validation set by reporting the score on a doc after its weights have adapted to it, rendering this unsound for the purposes of this competition.

sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
Two-phase TTT pipeline (novel combination):
- Phase 1: In-Place TTT — updates MLP output projections per-document (ICLR 2026)
- Phase 2: Per-doc LoRA TTT — adapts Q/V/LM head with surprise gating (top-K tokens)

Architecture: PR openai#486 base (11L, TrigramHash, ValueResidual, GradQuant) +
LeakyReLU(0.5)^2 + eval-only XSA on all layers + Partial RoPE + LN Scale

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
Run 1 showed:
- Pre-quant val_bpb: 1.1757
- Post-quant sliding window: 1.3569
- Quantization penalty: 0.18 bpb (expected ~0.003)

Root cause: Our GPTQ implementation (ported from PR openai#535) produced
WORSE quantization than standard per-row int6. PR openai#486 base doesn't
use GPTQ at all. Possible issues: bad Hessian calibration, numerical
instability in Cholesky decomposition, or name mismatch between
hooks and state dict keys.

Fix: Disable GPTQ, revert to standard quantization path.
GPTQ code preserved for future debugging.

Also confirmed: TTT bpb formula is algebraically correct.
The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
In-Place TTT: loss INCREASES (2.63+), 955s+ eval time. Harmful.
GradQuant int5/int6 mix: 34KB over 16MB even without int7.
PR openai#486 baseline reproduced at 1.1249 (within seed variance of 1.1233).

Added lessons 13-16 to CLAUDE.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants