Legal TTT (SGD, 3-epoch) + SLOT (lr=0.003, steps=5) on PR #549 base -- val_bpb: 1.11512 (3-seed mean, beats merged SOTA 1.1194)#1150
Conversation
9e21222 to
f24a386
Compare
|
Note on GPTQ-lite: this uses data-free percentile clipping on weight tensors only ( |
|
@valerio-oai @0hq requesting an organizer ruling on SLOT legality before this PR is reviewed. This submission uses SLOT (arXiv:2505.12392v2) layered on top of legal score-first TTT. A legality concern was raised in PR #1128 (see @dexhunter's comment there) that I want to address proactively. How SLOT works in our implementation:
The two concerns raised by @dexhunter in #1128:
I self-flagged this rather than waiting for a reviewer to catch it. Happy to close this PR immediately if the ruling is that SLOT violates the causal or score-before-update conditions. Alternatively, if there's a minimal fix (e.g., per-position delta, or per-token sequential optimization) that would make it legal, I can implement that. Requesting explicit ruling so the community has clarity multiple PRs are building on SLOT. |
Summary
Builds on the merged SOTA (PR #549,
2026-03-23_LeakyReLU_LegalTTT_ParallelMuon, 1.1194 bpb) by adding SLOT (Sample-specific LM Optimization at Test-time) as a complementary eval-time adaptation on top of the existing legal TTT.Results
All seeds: artifact < 16MB (pass), eval < 600s (pass), beats merged SOTA 1.1194 (pass)
Architecture
No changes to training. Identical to PR #549 base:
Eval-Time Additions
Legal TTT (unchanged from PR #549)
SLOT (new)
SLOT (Sample-specific LM Optimization at Test-time) adds a per-batch residual delta in R^512 that adapts the hidden->logit mapping without modifying model weights.
Protocol:
Legality: Model weights are frozen throughout. delta is optimized only on tokens that have already been scored in the current batch window (same guarantees as the existing TTT). Score-first ordering is preserved. One left-to-right pass.
Efficiency: The expensive forward (11 layers) runs once. The 5 optimization steps are through a single linear layer -- total SLOT overhead ~14ms per batch, well within the 600s budget.
Hyperparameters: SLOT_LR=0.003, SLOT_STEPS=5 (ablated: default lr=0.001/steps=3 gives -0.0007 bpb; tuned config gives -0.0042 bpb)
Compliance Verification