Skip to content

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack#487

Open
anantdgoel wants to merge 1 commit intoopenai:mainfrom
anantdgoel:non-record-vr-ga-production
Open

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack#487
anantdgoel wants to merge 1 commit intoopenai:mainfrom
anantdgoel:non-record-vr-ga-production

Conversation

@anantdgoel
Copy link
Copy Markdown

val_bpb: 1.1720 | 19.4 MB (unlimited compute) | 1xA6000, 9500 steps, 14.5hr

Summary

  • Value Residual (ResFormer, arXiv:2410.17897): caches layer-0 V vectors, mixes into subsequent layers via learnable scalars. -0.015 BPB, 22 params added.
  • Gated Attention (arXiv:2505.06708): per-head sigmoid gate after SDPA, eliminates attention sinks. -0.003 BPB, ~37K params added.
  • Techniques stack additively (-0.0172 combined), validated via controlled ablation on 9L baseline.
  • Full community meta-stack: 11L MLP3x + SmearGate + BigramHash(2048) + OrthoInit + WD0.04 + XSA(4) + EMA(0.997) + Partial RoPE + LN Scale + Logit Softcap.
  • Both techniques independently adopted by 5+ community submissions, including a record-tier entry (1.1101 BPB).

Ablation (9L v1024, 1000 steps, 131K batch, 1x3090)

Config val_bpb Delta
Control 1.4697
+ Gated Attention 1.4665 -0.0032
+ Value Residual 1.4546 -0.0151
+ Both 1.4525 -0.0172

Production Results

Metric Value
Pre-quant val_bpb 1.1710
Post-quant val_bpb 1.1720
Quant gap 0.0010
Artifact 19.4 MB

Files

  • README.md — full writeup with ablations and reproducibility command
  • submission.json — metadata
  • train_gpt.py — training script
  • train.log — complete training log

…) on 11L production stack

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant