Skip to content

Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442)#317

Open
chris-buckley wants to merge 4 commits intoopenai:mainfrom
chris-buckley:record/11L-XSA4-EMA-TTT-Int6-MLP3x
Open

Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442)#317
chris-buckley wants to merge 4 commits intoopenai:mainfrom
chris-buckley:record/11L-XSA4-EMA-TTT-Int6-MLP3x

Conversation

@chris-buckley
Copy link
Copy Markdown

@chris-buckley chris-buckley commented Mar 21, 2026

This takes the current best public training stack and makes one bet on top of it: full-model SGD test-time training at eval.

The training recipe is the 11L SmearGate/BigramHash/XSA4/EMA/int6+zstd-22 setup that's been winning. I kept it intact and added TTT as an orthogonal eval-time pass. It costs zero training budget. The SGD pass (3 epochs, lr=0.002, momentum=0.9, first 2 blocks frozen) runs on the dequantized checkpoint before scoring and takes about 50-80s.

What's different from Compression-Funded MLP3x

  • XSA on the last 4 layers
  • EMA instead of SWA
  • TTT at eval time (the big one)
  • Small LR bumps: matrix_lr 0.02 to 0.025, scalar_lr 0.02 to 0.025, tied_embed_lr 0.03 to 0.035
  • eval_stride 256 to 64

Results

Seed Post-TTT val_bpb Steps ms/step Artifact bytes
1337 1.1419 5,344 109.2 15,578,775
1338 1.1464 4,559 131.6 15,661,221
Mean 1.1442

Both artifacts under 16 MB. Both seeds beat the prior best single seed of 1.1598. Mean beats the prior mean of 1.1647 by 0.0205.

8xH100 SXM on community cloud. The two seeds ran on different nodes, which is why the step times differ. SDPA fallback since neither node had FA3 installed.

Training stack

11 layers, 512 dim, 8 heads / 4 KV heads, MLP 3x, SmearGate, BigramHash(2048), OrthoInit, muP-style output scaling, Muon/AdamW with WD=0.04, XSA on last 4 layers, EMA, int6 mixed quant + zstd-22.

Compatibility

The script falls back from FA3 to PyTorch SDPA automatically. I had to add manual KV head repeat for GQA since PyTorch 2.4 doesn't have enable_gqa, and clear the RoPE cache before TTT to avoid an inference tensor error during backward. It needs the zstandard package for zstd-22 compression (zlib puts the artifact over 16 MB).

Best public training stack (11L, MLP3x, SmearGate, BigramHash, XSA4, EMA,
int6+zstd-22) plus full-model SGD TTT at eval time. Seed 1337 on 8xH100 SXM.
Seed 1337: 1.1419, Seed 1338: 1.1464, Seed 1339: 1.1543.
All 3 seeds beat prior SOTA (1.1598). Mean delta vs prior: -0.0172.
Note: Seed 1339 artifact 222KB over 16MB limit (slower node).
Drop seed 1339 (artifact over 16MB limit). Seeds 1337+1338 both
under limit with mean val_bpb=1.1442, beating prior SOTA by 0.0205.
@chris-buckley chris-buckley changed the title Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1419) Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442) Mar 21, 2026
@cocohearts
Copy link
Copy Markdown
Collaborator

ah nice try, test time training unfortunately is not in "the spirit of the challenge"

@cocohearts cocohearts added the invalid This doesn't seem right label Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

invalid This doesn't seem right

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants