Skip to content

Record Submission: 1.1078 BPB — XSA6 + BigramHash4K on Hedge Mixer Stack#720

Closed
agalimova wants to merge 1 commit intoopenai:mainfrom
agalimova:submission/xsa6-bigram4k-hedgemixer
Closed

Record Submission: 1.1078 BPB — XSA6 + BigramHash4K on Hedge Mixer Stack#720
agalimova wants to merge 1 commit intoopenai:mainfrom
agalimova:submission/xsa6-bigram4k-hedgemixer

Conversation

@agalimova
Copy link
Copy Markdown

Summary

Changes from PR #700

Parameter Default Ours
XSA_LAST_N 4 6
BIGRAM_VOCAB_SIZE 2048 4096

Test plan

  • 3 seeds run on 8xH100 SXM (torch 2.9+cu126, FA3)
  • Mean improvement over merged SOTA (1.1194): -0.0116 BPB
  • All runs under 16MB artifact limit (15.3MB)
  • All runs under 600s training wallclock
  • Full training logs available (summaries included, full logs on request)

🤖 Generated with Claude Code

Built on PR openai#700 with hyperparameter improvements found via
autoresearch-multi combinatorial search:
- XSA_LAST_N=6 (extended from 4 to 6 layers)
- BIGRAM_VOCAB_SIZE=4096 (doubled from 2048)

3-seed mean: 1.1078 (std 0.0045)
Seeds: 42=1.1045, 1337=1.1061, 2025=1.1129

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Mar 27, 2026
Built on PR openai#720 by @agalimova. Novel TTT recipe:
- Per-layer LR groups (3x proj, 0.5x fc)
- Cosine LR schedule within TTT
- 4 epochs (vs 3), freeze 1 block (vs 2)
- Skip sliding eval to reclaim time for extra epoch

3-seed results:
  Seed 1337: 1.0726 BPB (537s eval)
  Seed   42: 1.0635 BPB (546s eval)
  Seed 2025: 1.0806 BPB (531s eval)
  Mean:      1.0722 ± 0.009

All seeds: train <600s, eval <600s, artifact <16MB.
Beats merged SOTA (1.1194) by 0.047.
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Mar 27, 2026
Built on PR openai#720 by @agalimova. Key change: SGD TTT (lr=0.002,
momentum=0.9) replaces AdamW, producing -0.041 BPB improvement.

3-seed results:
  Seed 1337: 1.0312 BPB (540s eval)
  Seed   42: 1.0503 BPB (533s eval)
  Seed 2025: 1.0535 BPB (544s eval)
  Mean:      1.0450 ± 0.012

All seeds: train <600s, eval <600s, artifact <16MB.
Score-first legal TTT + backward-looking HedgeMixer.
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Mar 27, 2026
Built on PR openai#720 by @agalimova. Key improvement: momentum 0.95 (vs 0.9)
reduces variance and improves mean by 0.009 BPB.

3-seed results:
  Seed 1337: 1.0302 BPB (513s eval)
  Seed   42: 1.0365 BPB (533s eval)
  Seed 2025: 1.0419 BPB (539s eval)
  Mean:      1.0362 ± 0.006

Validated via comprehensive hyperparameter sweep:
  LR: 0.001/0.002/0.003 → 0.002 optimal
  Freeze: 0/1/2 → 0 optimal
  Epochs: 3/4/5 → 4 optimal
  Per-layer LR: 2x/3x/4x proj → 3x optimal
  Momentum: 0.9/0.95 → 0.95 optimal
@valerio-oai
Copy link
Copy Markdown
Contributor

Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants