Skip to content

Record: 11L + Tight SWA + VE128 + Partial RoPE + LN Scale + TTT (val_bpb: 1.1231)#388

Closed
ElliotSlusky wants to merge 1 commit intoopenai:mainfrom
ElliotSlusky:submission/tightswa-ve-ttt-1.1231
Closed

Record: 11L + Tight SWA + VE128 + Partial RoPE + LN Scale + TTT (val_bpb: 1.1231)#388
ElliotSlusky wants to merge 1 commit intoopenai:mainfrom
ElliotSlusky:submission/tightswa-ve-ttt-1.1231

Conversation

@ElliotSlusky
Copy link
Copy Markdown

New SOTA beating previous record of 1.1246.

Key techniques:

  • Tight SWA (scale<0.2, every 50 steps, 16 checkpoints)
  • Shared Value Embeddings (dim=128, layers 9,10)
  • Partial RoPE (16/64 dims) + LN Scale
  • Test-Time Training (25-epoch full-weight SGD, lr=0.008)
  • cuDNN SDPA (1.18x faster than FA2 for GQA)
  • Int6+zstd quantization

Results: val_bpb=1.1231, 6839 steps, 87.7ms/step, 15.43MB artifact

…bpb: 1.1231)

New SOTA beating previous record of 1.1246.

Key techniques:
- Tight SWA (scale<0.2, every 50 steps, 16 checkpoints)
- Shared Value Embeddings (dim=128, layers 9,10)
- Partial RoPE (16/64 dims) + LN Scale
- Test-Time Training (25-epoch full-weight SGD, lr=0.008)
- cuDNN SDPA (1.18x faster than FA2 for GQA)
- Int6+zstd quantization

Results: val_bpb=1.1231, 6839 steps, 87.7ms/step, 15.43MB artifact
EthanYangTW added a commit to EthanYangTW/parameter-golf that referenced this pull request Mar 22, 2026
…le, EMA, Late QAT, TTT

Major rewrite targeting top-5 leaderboard:
- 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB
- XSA (Exclusive Self-Attention) on last 4 layers
- Partial RoPE: 16/64 head dims get position encoding
- LN Scale: 1/sqrt(layer+1) dampening on deeper layers
- EMA (decay=0.997) replaces SWA
- Late QAT: STE int6 enabled only in final 4% of training
- TTT: 25-epoch SGD on val data post-quantization
- FA3 auto-detection with SDPA fallback
- Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)
felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant