Skip to content

11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)#264

Open
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/11L-int5-ttt
Open

11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)#264
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/11L-int5-ttt

Conversation

@stukenov
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1455 (seed 1337, single seed — 3-seed validation in progress)
  • Artifact: 15.94 MB (int5-MLP + int6-attn + zstd-22)

Techniques

Technique Source Impact
11 layers (vs 9 baseline) Funded by int5 savings More model capacity
Int5 MLP [-16,15] + Int6 attention [-32,31] Inspired by #180 Saves ~1.9MB, funds 11th layer
Full-model SGD TTT (2 epochs) Inspired by #152 ~0.005 BPB at eval time
SmearGate + BigramHash From #102/#135 Bigram context injection
SWA (30 checkpoints) From #89 Better generalization
OrthoInit + muP scaling From #162 Stable training
Muon WD=0.04 From #60 Quantization-friendly weights
Sliding window eval stride=64 From #50 Full context per token

Architecture

11L / 512d / 8h / 4kv (GQA) / MLP 3x / relu^2 / 2048 seq_len / 26.67M params

Results

Stage val_bpb
End of training (5197 steps) 1.1583
Post int5/int6 quant + sliding window 1.1507
Post TTT (2 epochs SGD, lr=0.002) 1.1455

Trained on 8xH100 SXM, 600s wallclock, 115ms/step.

Test plan

  • Single seed run (1337): 1.1455 BPB
  • 3-seed validation (1337, 42, 2025) — in progress
  • Artifact under 16MB: 15.99 MB total
  • Training under 10 min: 600s on 8xH100
  • Eval under 10 min: ~696s (TTT 422s + sliding eval 273s)

11-layer model with mixed int5/int6 quantization, full-model SGD
test-time training, SmearGate, BigramHash, SWA, and OrthoInit.

Single seed result (1337): val_bpb=1.1455
Artifact: 15.94 MB (under 16MB limit)
3-seed validation in progress.
@mohosy
Copy link
Copy Markdown

mohosy commented Mar 21, 2026

ttt with int5 mlp is a sick combo, how long does your ttt take during eval? tryna figure out if 3 epochs is worth it over 2

@stukenov
Copy link
Copy Markdown
Author

@notapplica i am need runpod credits for 3-seed validation

HyperPotatoNeo added a commit to HyperPotatoNeo/parameter-golf that referenced this pull request Mar 21, 2026
Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264),
MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048),
SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer.
Single-seed result (seed=1337), ~8903 steps on 8xH100.
ThomAub pushed a commit to ThomAub/parameter-golf that referenced this pull request Mar 22, 2026
Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442)
flagged as potentially invalid for adapting on eval tokens BEFORE scoring them.
Added correct score-then-adapt protocol with implementation guide.

https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants