[Non-Record] XSA-all-layers + VRL + bigram3072 + lzma9 — 1.1509 bpb, AdamW TTT findings by Hilo-Hilo · Pull Request #1045 · openai/parameter-golf

Hilo-Hilo · 2026-03-28T23:53:51Z

Summary

1.15088552 bpb (sliding window stride=64, single seed) on 8×H100 SXM, 600s training
15,316,405 bytes (15.3MB, under 16MB cap)
Key finding: AdamW TTT at LR=0.002 degrades to 1.2804 bpb — SGD is better for TTT at this LR
lzma9 measured compression ratio ~0.96 for int6 weights (not 0.85 as zlib-based estimates suggest)
XSA on all 11 layers + Value Residual Learning + bigram3072 stack documented

Approach

Systematic sweep of four axes on the 11L d512 architecture:

Change from #414 stack	Delta bpb (approx)
XSA on all 11 layers (XSA_LAST_N=11)	−0.002
Value Residual Learning (VALUE_RESIDUAL=1)	−0.001
bigram3072 (3072-vocab bigram head, dim=112)	−0.001
lzma preset=9 (vs preset=6)	0.0 bpb, −200KB artifact

Combined result without TTT: 1.1509 bpb

Legal TTT eval (AdamW, lr=0.002, 3ep): 1.2804 bpb — TTT at this LR degrades quality significantly.

AdamW TTT Finding

Replacing SGD with AdamW in the TTT adaptation loop at the same LR (0.002) caused a +0.13 bpb regression. The model without TTT (sliding window eval) scores 1.1509.

Possible causes:

AdamW's adaptive LRs interact poorly with per-document adaptation (optimizer state reset each doc)
LR=0.002 appropriate for SGD but too high for AdamW in this setting
SOTA TTT approaches use SGD with momentum tuned for TTT; AdamW is not a drop-in replacement

Recommendation: If using AdamW for TTT, use LR ~1e-4 to 1e-3 and reset optimizer state per-document.

Architecture

NUM_LAYERS=11, MODEL_DIM=512
XSA_LAST_N=11       # Cross-attention on ALL 11 layers
VALUE_RESIDUAL=1    # V = V + residual_V (value gating)
BIGRAM_VOCAB_SIZE=3072, BIGRAM_DIM=112
QAT_ENABLED=1       # Full-training fake-quant (STE int6)

Reproduction

torchrun --standalone --nproc_per_node=8 train_gpt.py

With env vars:

NUM_LAYERS=11 MODEL_DIM=512 XSA_LAST_N=11 VALUE_RESIDUAL=1
BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112
QAT_ENABLED=1 SWA_ENABLED=0
TTT_ENABLED=0

Note: set TTT_ENABLED=0 to reproduce the 1.1509 score (sliding window eval without TTT).

…damW TTT findings

non-record: XSA-all-layers + VRL + bigram3072 + lzma9 — 1.1509 bpb, A…

68aaefd

…damW TTT findings

Hilo-Hilo force-pushed the submission/adamwttt-xsa11-vrl-bigram3072-lzma9 branch from 6a63de5 to 68aaefd Compare March 29, 2026 00:09

notapplica mentioned this pull request Mar 29, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non-Record] XSA-all-layers + VRL + bigram3072 + lzma9 — 1.1509 bpb, AdamW TTT findings#1045

[Non-Record] XSA-all-layers + VRL + bigram3072 + lzma9 — 1.1509 bpb, AdamW TTT findings#1045
Hilo-Hilo wants to merge 1 commit intoopenai:mainfrom
Hilo-Hilo:submission/adamwttt-xsa11-vrl-bigram3072-lzma9

Hilo-Hilo commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Hilo-Hilo commented Mar 28, 2026

Summary

Approach

AdamW TTT Finding

Architecture

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant