Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM by AnubhavBharadwaaj · Pull Request #1128 · openai/parameter-golf

AnubhavBharadwaaj · 2026-03-30T09:43:20Z

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean)

val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM)

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT+SLOT bpb	TTT+SLOT time	Artifact
1337	84.2ms	7,131	1.1381	1.1153	568s	15,997,676
42	84.1ms	7,133	1.1384	1.1156	568s	15,891,784
2025	83.9ms	7,151	1.1380	1.1153	571s	15,891,988
Mean	84.1ms	7,138	1.1382	1.1154 (std 0.0002)	~569s	—

vs Previous SOTA (PR #549)

Metric	PR #549	This submission	Delta
val_bpb (3-seed mean)	1.1194	1.1154	-0.0040
val_loss (3-seed mean)	1.8916	1.8833	-0.0083 nats
Significance (p < 0.01)	—	Yes	All 3 seeds individually beat SOTA
Record bar (≥0.005 nats)	—	0.0083 nats	✅ Cleared

Key Innovation: SLOT (Sample-specific LM Optimization at Test-time)

First SLOT-based entry in Parameter Golf. SLOT optimizes a single additive δ ∈ ℝ^512 vector at the last hidden layer during TTT scoring, adapting the model's hidden-to-logit mapping per-batch.

Source: Hu et al., arXiv:2505.12392v2, "SLOT: Sample-specific Language Model Optimization at Test-time" (Westlake University, 2025)

How SLOT Works

The model's forward_logits() is split into forward_hidden() + compute_logits(). During TTT Phase 1 (scoring), SLOT optimizes δ between the two:

for each batch of windows:
    # 1. Get hidden states from TTT-adapted model
    H = model.forward_hidden(x_batch)           # [bsz, seq_len, 512]

    # 2. Optimize delta (5 AdamW steps, lr=0.003)
    delta = zeros(1, 1, 512)                    # broadcasts across batch + seq
    optimizer = AdamW([delta], lr=0.003)
    for step in range(5):
        logits = model.compute_logits(H + delta)
        loss = CE(logits[:, :-1], targets[:, 1:])
        loss.backward()                          # gradients only through lm_head
        optimizer.step()

    # 3. Score with adapted logits
    final_logits = model.compute_logits(H + delta)
    nll = CE(final_logits, targets)              # used for BPB

Why SLOT Works

SLOT and TTT address complementary bottlenecks:

TTT adapts all 27M model weights to local data distribution (chunk-level, SGD, 3 epochs)
SLOT fine-tunes the final hidden→logit mapping per-batch (5 AdamW steps on 512 params)

TTT gives SLOT better hidden states; SLOT gives TTT-adapted representations a final per-batch correction. The two stack because they operate at different granularities (chunk vs batch) and different model depths (all layers vs last layer only).

SLOT Properties

Zero artifact cost: δ is optimized from scratch per-batch during eval
Minimal overhead: +217s to eval (569s total vs 352s baseline TTT)
Score-first compliant: δ optimizes using autoregressive shift on the same tokens being scored; model weights are frozen during δ optimization
Clean toggle: SLOT_ENABLED=0 reproduces PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 baseline exactly

SLOT Hyperparameters

Parameter	Value	Notes
Learning rate	0.003	Tuned up from paper's 0.001 default (our model is 27M vs paper's 7B)
Steps	5	Tuned up from paper's 3 default
Optimizer	AdamW	weight_decay=1e-8, eps=1e-5 (from paper)
Delta shape	[1, 1, 512]	Broadcasts across batch and sequence
Delta init	zeros	Matches paper

Hyperparameter Ablation (seed 1337)

SLOT Config	BPB	Delta vs baseline
Disabled (baseline)	1.1195	—
lr=0.001, steps=3	1.1188	-0.0007
lr=0.003, steps=5	1.1153	-0.0042

Also Tested: CTW — Negative Result

Context Tree Weighting (Willems et al., 1995) was integrated and tested across three progressively improved implementations. All degraded BPB.

CTW Version	Change	BPB	Verdict
v1: Naive n-gram lookup	Deepest-match KT estimate, fixed w=0.1	1.1252	+0.005 worse
v2: Proper recursive	Full P_w = 0.5·P_e + 0.5·P_w_child + entropy gating	Not tested (speed)	—
v3: Vectorized entropy gate	Batch entropy, selective CTW loop	Still worse	Killed early

Root cause: The 11-layer transformer at 1.12 BPB already captures all n-gram patterns a depth-4 Markov model knows. Mixing in a weaker predictor adds noise regardless of implementation quality.

Also Tested: Stacking Hacks — Negative Results

Hack	Mechanism	BPB	Verdict
Adaptive Temperature	Optimize temp scalar per-batch via SGD	1.1325	+0.014 worse
Focal TTT	Upweight hard tokens in Phase 2 via focal loss	1.1441	+0.025 worse

Base Architecture (PR #549 by @abaybektursun)

11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3×
Parameter Banking + Parallel Muon (FlashAttention 3)
BigramHash(1536), XSA4, Partial RoPE(16), LN Scale, VE128
EMA(0.997) + Tight SWA(50), GPTQ-lite int6 + LZMA-6
Legal Score-First TTT (SGD, lr=0.002, 3 epochs, 32K chunks)

Run Command

cd /workspace/parameter-golf && SEED=1337 SLOT_ENABLED=1 SLOT_LR=0.003 SLOT_STEPS=5 \
CTW_WEIGHT=0 NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

SLOT integration, tuning, and analysis: Anubhav (@AnubhavBharadwaaj)
SLOT algorithm: Yang Hu et al. (arXiv:2505.12392v2, Westlake University)
CTW negative result analysis: Anubhav (@AnubhavBharadwaaj)
LeakyReLU²: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 by @parinzee, PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 by @sofiabod
Parallel Muon + Parameter Banking: PR Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) #399 by @abaybektursun
TTT recipe: PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 by @Christopher-Lee-McClendon (adapted: freeze=0)
Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush

@0hq or @valerio-oai
Hey @0hq, I've applied for the Development grant several times but no response yet. GitHub: AnubhavBharadwaaj. Could you help check the status?

…PB (1.1185, 3-seed mean)

@abaybektursun

…result on PR openai#549 stack First SLOT (Sample-specific LM Optimization at Test-time) entry in Parameter Golf. SLOT optimizes a delta vector at the last hidden layer inside the TTT scoring loop. SLOT results (3-seed): seed 1337: 1.1188 BPB | seed 42: 1.1185 BPB | seed 2025: 1.1183 BPB mean: 1.1185 (std 0.0003) vs baseline 1.1193 — consistent -0.0008 improvement Also documents CTW as a negative result across 3 implementation iterations: v1 (naive n-gram lookup): +0.005 worse, 46 min eval v2 (proper recursive weighting + entropy gating): not runnable in time budget v3 (vectorized entropy gate): still worse, killed early Root cause: signal redundancy — transformer already captures all n-gram patterns Base: PR openai#549 by @abaybektursun (LeakyReLU² + Legal TTT + Parallel Muon)

@abaybektursun

…4 (3-seed mean) First SLOT (Sample-specific LM Optimization at Test-time) entry in Parameter Golf. Optimizes 512-dim delta vector at last hidden layer per-batch during TTT scoring. AdamW lr=0.003, 5 steps. Splits forward_logits() into forward_hidden() + compute_logits(). 3-seed results (8xH100 SXM): seed 1337: 1.1153 BPB | seed 42: 1.1156 BPB | seed 2025: 1.1153 BPB mean: 1.1154 (std 0.0002) | val_loss mean: 1.8833 vs SOTA PR openai#549: -0.0083 nats (>0.005 required) ✅ Base: PR openai#549 by @abaybektursun SLOT paper: Hu et al., arXiv:2505.12392v2

dexhunter · 2026-03-31T07:03:45Z

Hi @AnubhavBharadwaaj -- constructive observation about SLOT legality that might be worth considering.

After reviewing the organizer's enforcement pattern on Issue #677, I noticed that SLOT may fall under the same "adapt on validation before the reported eval pass" pattern that led to 33+ PR closures (valerio-oai, 2026-03-27):

Condition 3 (Score before update): SLOT optimizes the delta using F.cross_entropy on target tokens (y_batch), then scores those same tokens with the optimized delta. The delta is the "runtime state" being updated using x_t before x_t is scored.
Condition 1 (Causality): The delta has shape [1,1,512] and broadcasts across all positions. Since it's optimized over all positions in the batch, the prediction at position t is influenced by tokens at positions t+1, t+2, ..., which violates strict prefix-only dependence.

This differs from the legal score-first TTT in PR #549, where chunk N is scored first (under inference_mode()), then the model trains on chunk N for future chunks. SLOT adapts and scores the same tokens in the same batch.

No organizer has ruled on SLOT specifically, so this may be fine -- but I wanted to flag it so the community can discuss before multiple PRs build on this technique. An organizer clarification on Issue #677 or #1017 would help everyone.

(We had a SLOT-based submission at 1.1015 that we self-closed for this reason: PR #1172.)

AnubhavBharadwaaj added 4 commits March 29, 2026 21:58

Non-record: SLOT eval-time augmentation — first SLOT entry, -0.0008 B…

142ca2a

…PB (1.1185, 3-seed mean)

Log files for each seed test

44a59d4

Updated README.md with room for improvment and compute request

0331ed1

dexhunter mentioned this pull request Mar 31, 2026

Record: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean) #1172

Open

sahiee-dev mentioned this pull request Mar 31, 2026

Legal TTT (SGD, 3-epoch) + SLOT (lr=0.003, steps=5) on PR #549 base -- val_bpb: 1.11512 (3-seed mean, beats merged SOTA 1.1194) #1150

Open

bigbag mentioned this pull request Mar 31, 2026

Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176

Open

AnubhavBharadwaaj mentioned this pull request Mar 31, 2026

Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM #1084

Open

andrewbaggio1 mentioned this pull request Apr 2, 2026

Non-record: Does SLOT violate causal dependence? (empirical test + question) #1240

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM#1128

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM#1128
AnubhavBharadwaaj wants to merge 5 commits intoopenai:mainfrom
AnubhavBharadwaaj:anubhav-slot-record

AnubhavBharadwaaj commented Mar 30, 2026 •

edited

Loading

Uh oh!

dexhunter commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AnubhavBharadwaaj commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean)

3-Seed Results (8×H100 80GB SXM)

vs Previous SOTA (PR #549)

Key Innovation: SLOT (Sample-specific LM Optimization at Test-time)

How SLOT Works

Why SLOT Works

SLOT Properties

SLOT Hyperparameters

Hyperparameter Ablation (seed 1337)

Also Tested: CTW — Negative Result

Also Tested: Stacking Hacks — Negative Results

Base Architecture (PR #549 by @abaybektursun)

Run Command

Credits

Uh oh!

dexhunter commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnubhavBharadwaaj commented Mar 30, 2026 •

edited

Loading