Skip to content

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#175

Open
anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio:submission/ttt-sota-graft
Open

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#175
anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio:submission/ttt-sota-graft

Conversation

@anthony-maio
Copy link
Copy Markdown

@anthony-maio anthony-maio commented Mar 20, 2026

Summary

val_bpb = 1.1229 (3-seed mean, std 0.0005) | ~15.89 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps val_bpb Artifact
1337 87.1ms 6,889 1.1234 15,887,926
42 88.0ms 6,818 1.1225 15,877,570
2025 87.5ms 6,857 1.1228 15,890,566
Mean 87.5ms 6,855 1.1229 (std 0.0005)

All 3 artifacts under 16,000,000 bytes. All 3 train logs attached.

Key Innovations

LeakyReLU(0.5)²: One-line activation swap preserving negative gradient flow through MLP. ~-0.002 BPB vs standard relu². Credit: PR #493 @parinzee, PR #518 @sofiabod.

Value Residual Learning (VRL): Layer 0's V output blended into all subsequent attention layers via learned sigmoid gates. Combats attention concentration (ResFormer, arXiv:2410.17897). +10 scalar params. Credit: PR #569 @gowtham0992.

lzma compression: Stdlib replacement for zstd-22, compresses 2-5% tighter on quantized weights. Recovers ~300-500KB headroom, enabling full MLP 3× + BigramHash 2048 under 16MB without capacity cuts. No external dependencies.

Architecture

PR #414 base + LeakyReLU² + VRL + lzma:

Component Details
Layers 11L, 512d, 8H/4KV (GQA), U-Net skips (5 enc, 6 dec)
MLP 3× expansion (1536), LeakyReLU(0.5)² activation
Attention XSA4, Partial RoPE 16/64, LN Scale 1/√(i+1), VRL
Embeddings BigramHash(2048), VE128 (layers 9-10), SmearGate
Training EMA(0.997) + Tight SWA, Late QAT (STE@0.15), OrthoInit
Optimizer Muon WD=0.04, warmdown=3500, batch=786K tokens
Quantization GPTQ-lite int6 + lzma (preset=6)
Attention kernel FlashAttention 3 (Hopper native)

Credits

Test plan

  • Seed 1337: 1.1234 bpb, 15.89MB valid
  • Seed 42: 1.1225 bpb, 15.88MB valid
  • Seed 2025: 1.1228 bpb, 15.89MB valid
  • 3-seed mean: 1.1229, std 0.0005
  • All 3 train logs attached
  • All artifacts under 16,000,000 bytes

…toneInit)

Combines the two strongest orthogonal improvements that haven't been stacked:
- SOTA training (1.1748): 10L, Muon WD, FP16 tied embed, spectral init, sliding window
- TTT eval (1.1928 on baseline): per-document LoRA adaptation, document isolation

Modifies CausalSelfAttention, Block, and GPT forward methods to accept
optional LoRA delta parameters while preserving default training behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 20, 2026 05:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds LoRA test-time training (TTT) evaluation on top of the existing SOTA training recipe in the records/track_10min_16mb track, aiming to report both standard sliding-window validation and per-document LoRA-adapted validation results.

Changes:

  • Extends the model forward path to accept optional batched LoRA deltas (Q/V + LM head) used only during evaluation.
  • Adds batched per-document LoRA adaptation and a document-isolated, chunked TTT evaluation routine.
  • Runs sliding-window eval after int8+zlib roundtrip, then runs TTT LoRA eval and logs both metrics.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
records/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/train_gpt.py Implements TTT LoRA adapters + evaluation and integrates it after the existing quantized sliding-window eval.
records/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/submission.json Adds submission metadata for the new record entry.
records/track_10min_16mb/2026-03-20_TTT_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/README.md Documents the combined recipe, expected results, and reproduction command.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Replaces original TTT LoRA submission with current validated stack.
3-seed: 1.1234 / 1.1225 / 1.1228 = mean 1.1229 (std 0.0005)
All artifacts under 16MB. All logs attached.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@anthony-maio anthony-maio changed the title Record: TTT LoRA + SOTA Training (10L MuonWD FP16Emb OvertoneInit) Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants