Skip to content

Legal TTT (SGD, 3-epoch) + SLOT (lr=0.003, steps=5) on PR #549 base -- val_bpb: 1.11512 (3-seed mean, beats merged SOTA 1.1194)#1150

Open
sahiee-dev wants to merge 1 commit intoopenai:mainfrom
sahiee-dev:clean-sota-record
Open

Legal TTT (SGD, 3-epoch) + SLOT (lr=0.003, steps=5) on PR #549 base -- val_bpb: 1.11512 (3-seed mean, beats merged SOTA 1.1194)#1150
sahiee-dev wants to merge 1 commit intoopenai:mainfrom
sahiee-dev:clean-sota-record

Conversation

@sahiee-dev
Copy link
Copy Markdown

Summary

Builds on the merged SOTA (PR #549, 2026-03-23_LeakyReLU_LegalTTT_ParallelMuon, 1.1194 bpb) by adding SLOT (Sample-specific LM Optimization at Test-time) as a complementary eval-time adaptation on top of the existing legal TTT.

Results

Seed val_bpb val_loss Artifact (bytes) Eval time
42 1.11514996 1.88287900 15,956,372 571.8s
1337 1.11501287 1.88264755 15,959,700 568.5s
2025 1.11520712 1.88297553 15,947,196 571.3s
mean 1.11512 1.88283 -- --

All seeds: artifact < 16MB (pass), eval < 600s (pass), beats merged SOTA 1.1194 (pass)

Architecture

No changes to training. Identical to PR #549 base:

  • 11 layers, 512 dim, 8H/4KV GQA, Partial RoPE (rope_dims=16)
  • LeakyReLU(0.5)^2 MLP (mlp_mult=2.80)
  • BigramHash (vocab=1536, dim=128), VE128 at layers 9-10, XSA last 4 layers
  • LN Scale, DTG gate, U-Net skips, SmearGate
  • EMA(0.997) + SWA -> GPTQ-lite int6 + LZMA-6 -> ~15.95-15.96MB artifact
  • Parallel Muon + AdamW optimizer

Eval-Time Additions

Legal TTT (unchanged from PR #549)

  • SGD, 3 epochs per chunk, score first, no future token leakage
  • TTT_LR=0.002, TTT_CHUNK_TOKENS=32768, TTT_BATCH_SEQS=32

SLOT (new)

SLOT (Sample-specific LM Optimization at Test-time) adds a per-batch residual delta in R^512 that adapts the hidden->logit mapping without modifying model weights.

Protocol:

  1. H = forward_hidden(x_batch) -- full 11-layer forward, single pass, torch.no_grad()
  2. H.detach() -- gradient graph cut from model weights
  3. 5 AdamW steps optimizing delta through compute_logits(H + delta) only -- backprop through the final linear projection only (~524K weights, ~1% of full forward cost)
  4. Score with compute_logits(H + delta.detach()) -- this is the reported BPB

Legality: Model weights are frozen throughout. delta is optimized only on tokens that have already been scored in the current batch window (same guarantees as the existing TTT). Score-first ordering is preserved. One left-to-right pass.

Efficiency: The expensive forward (11 layers) runs once. The 5 optimization steps are through a single linear layer -- total SLOT overhead ~14ms per batch, well within the 600s budget.

Hyperparameters: SLOT_LR=0.003, SLOT_STEPS=5 (ablated: default lr=0.001/steps=3 gives -0.0007 bpb; tuned config gives -0.0042 bpb)

Compliance Verification

  • Rule 1 (no future tokens in p_t): pass -- delta optimized only on already-scored tokens
  • Rule 3 (score before update): pass -- score-first at every position
  • Rule 4 (one left-to-right pass): pass -- no rescoring
  • GPTQ-lite: data-free percentile clipping on weight tensors only (torch.quantile on param.data; no training data reopened post-600s)
  • N-gram cache: not used
  • Multi-epoch TTT before scoring: not present
  • Eval time: 568.5s-571.8s across all seeds
  • Artifact: 15.947MB-15.960MB across all seeds

@sahiee-dev
Copy link
Copy Markdown
Author

Note on GPTQ-lite: this uses data-free percentile clipping on weight tensors only (torch.quantile on param.data, line 1548 in train_gpt.py). No training data is reopened after the 600s training cap. Flagging this proactively given the discussion in #677.

@sahiee-dev
Copy link
Copy Markdown
Author

@valerio-oai @0hq requesting an organizer ruling on SLOT legality before this PR is reviewed.

This submission uses SLOT (arXiv:2505.12392v2) layered on top of legal score-first TTT. A legality concern was raised in PR #1128 (see @dexhunter's comment there) that I want to address proactively.

How SLOT works in our implementation:

  1. Model weights are fully frozen during SLOT
  2. forward_hidden() runs under torch.no_grad() -> no weight gradients
  3. A small delta vector (shape [1,1,512]) is optimized for 5 steps through compute_logits() only
  4. The same batch is then scored using (H + delta) -> model weights unchanged

The two concerns raised by @dexhunter in #1128:

  • Score-before-update: delta is optimized on the batch before scoring the same batch. However, model weights are never updated only a transient per-batch vector is adapted. Does this count as a "state update" under the competition rules?
  • Causality: delta shape [1,1,512] broadcasts across all sequence positions. Optimizing delta over the full batch means position t's delta is influenced by targets at t+1, t+2, etc.

I self-flagged this rather than waiting for a reviewer to catch it. Happy to close this PR immediately if the ruling is that SLOT violates the causal or score-before-update conditions. Alternatively, if there's a minimal fix (e.g., per-position delta, or per-token sequential optimization) that would make it legal, I can implement that.

Requesting explicit ruling so the community has clarity multiple PRs are building on SLOT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant