Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) by anthony-maio · Pull Request #175 · openai/parameter-golf

anthony-maio · 2026-03-20T05:38:49Z

Summary

val_bpb = 1.1229 (3-seed mean, std 0.0005) | ~15.89 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	val_bpb	Artifact
1337	87.1ms	6,889	1.1234	15,887,926
42	88.0ms	6,818	1.1225	15,877,570
2025	87.5ms	6,857	1.1228	15,890,566
Mean	87.5ms	6,855	1.1229 (std 0.0005)

All 3 artifacts under 16,000,000 bytes. All 3 train logs attached.

Key Innovations

LeakyReLU(0.5)²: One-line activation swap preserving negative gradient flow through MLP. ~-0.002 BPB vs standard relu². Credit: PR #493 @parinzee, PR #518 @sofiabod.

Value Residual Learning (VRL): Layer 0's V output blended into all subsequent attention layers via learned sigmoid gates. Combats attention concentration (ResFormer, arXiv:2410.17897). +10 scalar params. Credit: PR #569 @gowtham0992.

lzma compression: Stdlib replacement for zstd-22, compresses 2-5% tighter on quantized weights. Recovers ~300-500KB headroom, enabling full MLP 3× + BigramHash 2048 under 16MB without capacity cuts. No external dependencies.

Architecture

PR #414 base + LeakyReLU² + VRL + lzma:

Component	Details
Layers	11L, 512d, 8H/4KV (GQA), U-Net skips (5 enc, 6 dec)
MLP	3× expansion (1536), LeakyReLU(0.5)² activation
Attention	XSA4, Partial RoPE 16/64, LN Scale 1/√(i+1), VRL
Embeddings	BigramHash(2048), VE128 (layers 9-10), SmearGate
Training	EMA(0.997) + Tight SWA, Late QAT (STE@0.15), OrthoInit
Optimizer	Muon WD=0.04, warmdown=3500, batch=786K tokens
Quantization	GPTQ-lite int6 + lzma (preset=6)
Attention kernel	FlashAttention 3 (Hopper native)

Credits

Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush
LeakyReLU²: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 by @parinzee, PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 by @sofiabod
VRL: ResFormer (arXiv:2410.17897), PR Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175) #569 by @gowtham0992
XSA: PR Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287 by @jfprincz

Test plan

Seed 1337: 1.1234 bpb, 15.89MB valid
Seed 42: 1.1225 bpb, 15.88MB valid
Seed 2025: 1.1228 bpb, 15.89MB valid
3-seed mean: 1.1229, std 0.0005
All 3 train logs attached
All artifacts under 16,000,000 bytes

…toneInit) Combines the two strongest orthogonal improvements that haven't been stacked: - SOTA training (1.1748): 10L, Muon WD, FP16 tied embed, spectral init, sliding window - TTT eval (1.1928 on baseline): per-document LoRA adaptation, document isolation Modifies CausalSelfAttention, Block, and GPT forward methods to accept optional LoRA delta parameters while preserving default training behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds LoRA test-time training (TTT) evaluation on top of the existing SOTA training recipe in the records/track_10min_16mb track, aiming to report both standard sliding-window validation and per-document LoRA-adapted validation results.

Changes:

Extends the model forward path to accept optional batched LoRA deltas (Q/V + LM head) used only during evaluation.
Adds batched per-document LoRA adaptation and a document-isolated, chunked TTT evaluation routine.
Runs sliding-window eval after int8+zlib roundtrip, then runs TTT LoRA eval and logs both metrics.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
records/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/train_gpt.py	Implements TTT LoRA adapters + evaluation and integrates it after the existing quantized sliding-window eval.
records/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/submission.json	Adds submission metadata for the new record entry.
records/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/README.md	Documents the combined recipe, expected results, and reproduction command.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...s/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/train_gpt.py

Replaces original TTT LoRA submission with current validated stack. 3-seed: 1.1234 / 1.1225 / 1.1228 = mean 1.1229 (std 0.0005) All artifacts under 16MB. All logs attached. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 20, 2026 05:38

Copilot started reviewing on behalf of anthony-maio March 20, 2026 05:39 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

anthony-maio closed this Mar 25, 2026

anthony-maio reopened this Mar 25, 2026

anthony-maio closed this Mar 25, 2026

anthony-maio reopened this Mar 25, 2026

anthony-maio changed the title ~~Record: TTT LoRA + SOTA Training (10L MuonWD FP16Emb OvertoneInit)~~ Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) Mar 25, 2026

anthony-maio mentioned this pull request Mar 25, 2026

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #657

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#175

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#175
anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio:submission/ttt-sota-graft

anthony-maio commented Mar 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anthony-maio commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Key Innovations

Architecture

Credits

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anthony-maio commented Mar 20, 2026 •

edited

Loading