Skip to content

Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed)#398

Open
felipe-parodi wants to merge 1 commit intoopenai:mainfrom
felipe-parodi:submission/11L-EMA-TTT20ep-1.1213
Open

Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed)#398
felipe-parodi wants to merge 1 commit intoopenai:mainfrom
felipe-parodi:submission/11L-EMA-TTT20ep-1.1213

Conversation

@felipe-parodi
Copy link
Copy Markdown

Record: 11L EMA + TTT(20ep) — val_bpb: 1.1213

val_bpb = 1.1213 (sliding window stride=64, best seed 1337) | 15.53 MB artifact | 8xH100 SXM, 600s

Key Finding: EMA + Aggressive TTT with All Blocks Unfrozen

EMA(0.997) weight averaging combined with aggressive test-time training (20 epochs SGD, lr=0.008, all blocks unfrozen) outperforms Tight SWA + VE128 approaches (PR #388, 1.1231).

Results (3-seed, 8xH100 SXM)

Seed Steps Sliding BPB (s64) Artifact
1337 7386 1.1213 15.53 MB
42 7411 1.1221 15.51 MB
2025 7386 1.1228 15.53 MB

Mean: 1.1221 | Std: 0.0008

Critical Discoveries (15-run ablation)

  1. TTT_FREEZE_BLOCKS=0 is essential — freezing early blocks during aggressive TTT creates internal inconsistency (quant gap 5x worse)
  2. Late QAT is counterproductive with aggressive TTT
  3. XSA removed — saves ~1.4ms/step, yielding ~130 more training steps
  4. PPM-C eval blending hurts strong models — classical compression adds noise when neural model is already strong
  5. Memory tokens, aggressive warmdown, gradient-guided quant — all documented negative results

Run Command

pip install zstandard flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291

SEED=1337 NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=0 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=0 \
TTT_ENABLED=1 TTT_LR=0.008 TTT_EPOCHS=20 TTT_MOMENTUM=0.9 TTT_FREEZE_BLOCKS=0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

See the full README in the submission folder for detailed architecture, training config, TTT analysis, and the complete 15-run ablation table.

samuelczhao added a commit to samuelczhao/parameter-golf that referenced this pull request Mar 22, 2026
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 22, 2026
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 22, 2026
robinojw pushed a commit to robinojw/parameter-golf that referenced this pull request Mar 22, 2026
PR openai#398: 11L EMA + TTT(20ep, freeze=0), no XSA, no Late QAT
- Best seed 1.1213 BPB, 3-seed mean 1.1221
- 7386 steps at ~81ms/step
- Has: FA3, NTK RoPE, MTP, TTT, (B,S,H,D) layout
- Missing: memory tokens, magnitude pruning, late-K passthrough
robinojw pushed a commit to robinojw/parameter-golf that referenced this pull request Mar 22, 2026
…#398 base

Built on PR openai#398 (1.1213 BPB). Three targeted improvements:

1. Cautious Muon: mask Muon updates that disagree with gradient
   direction (~1.47x convergence speedup, 2 lines, zero risk)

2. Magnitude pruning (5% default): zero smallest weights before
   quantization, improves zstd compression ratio by 5-15%

3. allow_in_graph + cache_size_limit=32: safer torch.compile
   with FA3 custom ops and 11-block guard specialization

Respects PR openai#398 negative results: no memory tokens, no Late QAT.
robinojw pushed a commit to robinojw/parameter-golf that referenced this pull request Mar 22, 2026
robinojw pushed a commit to robinojw/parameter-golf that referenced this pull request Mar 22, 2026
sjp611 added a commit to sjp611/parameter-golf that referenced this pull request Mar 22, 2026
Replace SGD with AdamW for test-time training. 3-line diff from PR openai#398.
Mean val_bpb 1.1027 (3-seed), best 1.0992. Beats prior SOTA 1.1213 by -0.019.
ThomAub pushed a commit to ThomAub/parameter-golf that referenced this pull request Mar 22, 2026
Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442)
flagged as potentially invalid for adapting on eval tokens BEFORE scoring them.
Added correct score-then-adapt protocol with implementation guide.

https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
JoeProAI pushed a commit to JoeProAI/parameter-golf that referenced this pull request Mar 22, 2026
Architecture discovered via GEPA (Gemini-driven evolutionary search).
SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4.
AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442).
EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410).

3-seed results: 1.06733 / 1.06833 / 1.06580
Mean: 1.06715, Std: 0.00104

Built by @joepro with AI agents via OpenClaw.
Compute provided by Modal.
JoeProAI added a commit to JoeProAI/parameter-golf that referenced this pull request Mar 22, 2026
Architecture discovered via GEPA (Gemini-driven evolutionary search).
SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4.
AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442).
EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410).

3-seed results: 1.06733 / 1.06833 / 1.06580
Mean: 1.06715, Std: 0.00104

Built by @joepro with AI agents via OpenClaw.
Compute provided by Modal.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 22, 2026
…ttern)

Root cause: per-sequence indexing from permuted indices was ~100x slower
than contiguous val_tokens slicing. Each GPU now takes a contiguous
shard and iterates sequentially, matching openai#398's working implementation.
mohosy pushed a commit to mohosy/parameter-golf that referenced this pull request Mar 23, 2026
Major rewrite based on latest meta (PRs openai#398, openai#442, openai#462):
- SwiGLU FFN with Star-ReLU (hidden=1792)
- U-Net skip connections with learned gating
- EMA (decay=0.9985) replacing SWA
- AdamW TTT (legal score-first protocol)
- Partial RoPE (16 dims)
- LN Scale (1/sqrt(layer_idx+1))
- BigramHash(8192) + SmearGate
- GPTQ-lite quantization
- DDP compile fix for multi-GPU

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
… 3 seeds)

AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups
(3x for MLP output projections, 0.5x for input projections). 34 TTT
configurations tested. FINDINGS.md documents 31 experiments including
negative results on codebook quantization, symmetry-transport, layer
dropping, focal loss, and KL divergence TTT.

Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
@felipe-parodi felipe-parodi changed the title Record: 11L EMA + TTT(20ep,freeze=0) — val_bpb=1.1213 (3-seed mean 1.1221) Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed) Mar 23, 2026
@felipe-parodi
Copy link
Copy Markdown
Author

felipe-parodi commented Mar 23, 2026

Converting to non-record per TTT ruling (Issue #402). The README documents a 15-run ablation (memory tokens, causal TTT, PPM-C blending, grad-guided quant, aggressive warmdown — all negative results at the frontier) and the freeze_blocks=0 finding for aggressive TTT. Working on a non-TTT submission.

lukacf added a commit to lukacf/parameter-golf-submission that referenced this pull request Mar 23, 2026
3-seed mean: 0.9789 BPB (sliding window stride=64)
Best seed: 0.9779 (seed 7)
Std: 0.0015

Key innovation: Autonomous ML research methodology.
AI coding agent discovered cosine LR scaling for TTT in a single
2-hour session — 7 experiments from hypothesis to record.

Technical: CosineAnnealingLR over 100 TTT epochs (3-line change).
Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lukacf added a commit to lukacf/parameter-golf-submission that referenced this pull request Mar 23, 2026
3-seed mean: 0.9789 BPB (sliding window stride=64)
Best seed: 0.9779 (seed 7)
Std: 0.0015

Key innovation: Autonomous ML research methodology.
AI coding agent discovered cosine LR scaling for TTT in a single
2-hour session — 7 experiments from hypothesis to record.

Technical: CosineAnnealingLR over 100 TTT epochs (3-line change).
Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant