Skip to content

[Non-Record] XSA-all-layers + VRL + bigram3072 + lzma9 — 1.1509 bpb, AdamW TTT findings#1045

Open
Hilo-Hilo wants to merge 1 commit intoopenai:mainfrom
Hilo-Hilo:submission/adamwttt-xsa11-vrl-bigram3072-lzma9
Open

[Non-Record] XSA-all-layers + VRL + bigram3072 + lzma9 — 1.1509 bpb, AdamW TTT findings#1045
Hilo-Hilo wants to merge 1 commit intoopenai:mainfrom
Hilo-Hilo:submission/adamwttt-xsa11-vrl-bigram3072-lzma9

Conversation

@Hilo-Hilo
Copy link
Copy Markdown

Summary

  • 1.15088552 bpb (sliding window stride=64, single seed) on 8×H100 SXM, 600s training
  • 15,316,405 bytes (15.3MB, under 16MB cap)
  • Key finding: AdamW TTT at LR=0.002 degrades to 1.2804 bpb — SGD is better for TTT at this LR
  • lzma9 measured compression ratio ~0.96 for int6 weights (not 0.85 as zlib-based estimates suggest)
  • XSA on all 11 layers + Value Residual Learning + bigram3072 stack documented

Approach

Systematic sweep of four axes on the 11L d512 architecture:

Change from #414 stack Delta bpb (approx)
XSA on all 11 layers (XSA_LAST_N=11) −0.002
Value Residual Learning (VALUE_RESIDUAL=1) −0.001
bigram3072 (3072-vocab bigram head, dim=112) −0.001
lzma preset=9 (vs preset=6) 0.0 bpb, −200KB artifact

Combined result without TTT: 1.1509 bpb

Legal TTT eval (AdamW, lr=0.002, 3ep): 1.2804 bpb — TTT at this LR degrades quality significantly.

AdamW TTT Finding

Replacing SGD with AdamW in the TTT adaptation loop at the same LR (0.002) caused a +0.13 bpb regression. The model without TTT (sliding window eval) scores 1.1509.

Possible causes:

  • AdamW's adaptive LRs interact poorly with per-document adaptation (optimizer state reset each doc)
  • LR=0.002 appropriate for SGD but too high for AdamW in this setting
  • SOTA TTT approaches use SGD with momentum tuned for TTT; AdamW is not a drop-in replacement

Recommendation: If using AdamW for TTT, use LR ~1e-4 to 1e-3 and reset optimizer state per-document.

Architecture

NUM_LAYERS=11, MODEL_DIM=512
XSA_LAST_N=11       # Cross-attention on ALL 11 layers
VALUE_RESIDUAL=1    # V = V + residual_V (value gating)
BIGRAM_VOCAB_SIZE=3072, BIGRAM_DIM=112
QAT_ENABLED=1       # Full-training fake-quant (STE int6)

Reproduction

torchrun --standalone --nproc_per_node=8 train_gpt.py

With env vars:

NUM_LAYERS=11 MODEL_DIM=512 XSA_LAST_N=11 VALUE_RESIDUAL=1
BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112
QAT_ENABLED=1 SWA_ENABLED=0
TTT_ENABLED=0

Note: set TTT_ENABLED=0 to reproduce the 1.1509 score (sliding window eval without TTT).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant