Record: 11L + EMA + Tight SWA + QAT0.15 + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243)#401
Closed
newjordan wants to merge 134 commits intoopenai:mainfrom
Closed
Record: 11L + EMA + Tight SWA + QAT0.15 + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243)#401newjordan wants to merge 134 commits intoopenai:mainfrom
newjordan wants to merge 134 commits intoopenai:mainfrom
Conversation
… gravity needs more steps
11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep Exact reproduction of @timowhite88's FarnsworthEngine recipe. No modifications — run as-is to validate baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 Q heads with 4 KV heads needs repeat_interleave before matmul. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export) exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count) exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio All based on PR #254 SOTA clone (1.1303 BPB). Priority: exp_c first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TTT v2 (cosine LR decay, discriminative per-layer LR, low momentum 0.3, WD), seq-length curriculum (256→2048), batch warmup (262K→786K), D2Z LR schedule, XSA last 3, temperature scaling, optional Mousse optimizer. Two run scripts: full stack (run_v2.sh) and conservative TTT-only (run_v2_ttt_only.sh). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
flash_attn_interface (FA3 Hopper) → flash_attn (FA2) → torch SDPA. Script never crashes on missing flash-attn. Run scripts attempt pip install on startup if FA3 not found. Applied to both sota254 and sota_v2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…+ untested v2 Restores all four files to their state at 83efa9c. The FA3→FA2→SDPA fallback was added in response to an environment question and should not have touched application code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
torch.compile can promote tensors to fp32 which hits missing FA3 kernels (disabled at build time). Explicit bf16 cast prevents silent NaN output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A (MTP): 1.1619 BPB roundtrip — worse than baseline B (SwiGLU): 1.1348 BPB sliding — close but +0.0045 vs baseline Both artifacts over 16MB due to missing zstandard (zlib fallback) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…combine The self-exclusion mask + causal mask leaves position 0 with all -inf, producing NaN from softmax. Fix: don't self-exclude position 0 since it has no other causal targets to attend to. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XSA_LAST_N=3 was costing ~25% step time due to manual matmul path. Set to 0 to isolate TTT v2 + temp scaling gains at full speed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XSA manual attention killed step speed, only 4771/9000 steps completed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…seline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
exp_a MTP: 1.1619, exp_b SwiGLU: 1.1570, exp_c: missing tokenizer data. TTT v1 hurt in both exp_a and exp_b (same pattern as TTT v2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same model/artifact as SOTA254 baseline — zero risk. More TTT adaptation (3→8 epochs) and finer sliding window (64→32 stride). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TTT_SAM=1 enables SAM during test-time training. Two forward+backward passes per step: first computes gradient, perturbs weights by rho in gradient direction, then recomputes gradient at the perturbed point. Uses the perturbed gradient to update original weights, seeking flatter minima that generalize better. Motivated by TTT consistently overfitting: loss goes down but eval gets worse across all runs. SAM directly targets this failure mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exact settings from README. If this doesn't reproduce, the FA3 build is the variable, not the code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same training as the 1.1303 baseline, only change is TTT_SAM=1. SAM seeks flatter minima during test-time training to fix the TTT overfitting pattern (loss down, eval up). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TTT 8 epochs + stride 32. Stride made no difference — all gain from extra TTT adaptation. Same model/artifact, eval-only change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both seeds beat baseline. TTT 8 epochs is a free win. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed 7 compresses worse than 1337/42. BPB improved but artifact exceeds 16 MB cap. Need passing 3rd seed for submission. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ROPE_DIMS=16: apply rotary to 25% of head dims, rest position-free LN_SCALE=1: scale RMSNorm output by 1/sqrt(layer+1) Both env-var gated, default off — existing runs unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Star-ReLU(x)² replaces SiLU in SwiGLU (GEPA non-TTT SOTA uses this). Value Residual Learning: per-block learned lambda mixes current hidden state with first-block output for attention values (-0.015 BPB in PR#413). MLP trimmed 4.5->4.0 to fit 16MB budget (~24.9M params, ~13.8MB est). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SwiGLU fork (PR #462 base) + GPTQ + OptRot + AdamW TTT = 1.0763 BPB but artifact is 19.6MB (over 16MB limit). OptRot Hadamard rotation hurts zstd compression. Next step: solve the size problem. v7 GPTQ stack submitted as PR #508: 3-seed mean 1.1215 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Matches the 1.0763 run's settings via fractal compression: - Full MHA 8/8 (was GQA 8/4) - seq_len=1024 (2x more steps: ~8000 vs ~4300) - batch=524288 (even more steps) - MLP 3.5x SwiGLU Star-ReLU (fits 16MB with full MHA) - EMA decay 0.9985 (GEPA's setting) - warmdown 6000 (70% of expected steps) - bigram 8192, XSA last 3, VE on blocks 4,5 - 6 unique x 2 loops = 12 effective depth - ~24.6M params, est ~14.5MB artifact The 1.0763 model compressed into the 16MB box via fractal weight sharing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three configs to map the compression cost curve on H100: - v4 (6x2): 12 eff depth, 24.6M, ~13.3MB - v4a (5x2): 10 eff depth, 20.8M, ~11.3MB - v4b (4x2): 8 eff depth, 17.0M, ~9.2MB Spark data says 5x2 is optimal (beats flat by 0.011 BPB). H100 results will give the real calibration number. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
run_v7_short_ttt.sh: self-contained test script for Option A. SGD lr=0.002, 3 epochs, freeze 2 blocks, no EMA smoothing, stop training at chunk 50. Captures chunk-51 peak (1.1106 observed) without EMA dilution that killed TTT gains in PR #508. train_gpt_v7.py: add TTT_WARMUP_CHUNKS env var for optional LR warmup during TTT (default 0 = no change to existing behavior). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR #503 proves XSA on all 11 layers helps (1.1195 vs our 1.1215 with XSA on 4). Combined with short TTT no-EMA to target 1st place. Just an env var change (XSA_LAST_N=11), no code modifications. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-seed result on 8xH100 SXM, 600s training + 192s eval: Seed 1337: sliding_window=1.12070, legal_ttt=1.12075 Artifact: 15.60 MB (int6+zstd-22) Pre-TTT sliding window is the effective score — short TTT (SGD, 50 chunks, no EMA) was net neutral (+0.00005). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Surgical weight sharing: SHARE_START=4, SHARE_LOOPS=3 replaces 3 middle layers with 1 shared block looped 3x. NUM_LAYERS=9 gives 9 stored blocks, 11 effective depth. Saves 2 blocks (~4.4MB). 19.6MB -> ~15.2MB = fits 16MB budget. Orthogonal loop positions on the shared block. Rest of architecture unchanged: Star-ReLU SwiGLU, full MHA 8/8, GPTQ, TTT, EMA. The Frugendorff as a compression tool on an existing SOTA model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ITERATIONS=7500 matches actual wallclock steps so cosine warmdown completes (LR→0) instead of stopping at ~45% peak. Combined with XSA on all 11 layers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XSA-11 model was 24KB over 16MB limit. Trimming bigram hash table saves ~32K raw params to fit within budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On top of the SwiGLU compression shim: - LeakyReLU(0.5)² replaces ReLU² (preserves negative gradients) - VRL per-block lambda mixing first-block output (-0.015 BPB in PR#413) - decoder_lr_mult=2.0 already present from base Test A (clean): train_gpt_swiglu_frugendorff.py = pure compression cost Test B (stacked): train_gpt_swiglu_frugendorff_stacked.py = compression + extras Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tests in 2 batches for 2xGPU research: A: Bigram 1536 + XSA-11 (size fit) B: Bigram 1024 + XSA-11 (aggressive size) C: GPTQ percdamp=0.05 (conservative error compensation) D: GPTQ block_size=64 (less error accumulation) Wire GPTQ_BLOCK_SIZE and GPTQ_PERCDAMP as env vars. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed ttt_adapt function, all TTT hyperparameters, and TTT call sites from both clean and stacked versions. TTT trains on validation tokens before scoring — illegal per issue #402. All remaining features are pure training/architecture/quantization techniques: Star-ReLU, SwiGLU, GPTQ, EMA, U-Net, BigramHash, Frugendorff compression, VRL, LeakyReLU, decoder_lr_mult. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Systematic search for optimal weight sharing config: Batch 1: Size frontier (loop4, earlier sharing, fewer layers) Batch 2: Quality frontier (depth vs sharing tradeoffs) Batch 3: Compression levers (bigram, MLP tuning) Baseline: 11L/SHARE4/LOOPS3 = 1.0900 BPB, 16.68MB (over by 680KB) Target: fit ≤16MB while keeping BPB near 1.09 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Illegal TTT trains on validation tokens before scoring them, violating issue #402. Disabled TTT_ENABLED default to "0" in: - train_gpt_swiglu.py - train_gpt_frugendorff_v3.py - train_gpt_frugendorff_v4.py, v4a, v4b - train_gpt_v7.py, v7_short_ttt.py Removed eval_ttt.py (standalone illegal TTT eval). Legal techniques preserved: - TTT burst (training data replay) in v1/v4/v5/v6/squared - Inner-TTT in fractal h100 scripts (our own implementation) - All training, EMA, GPTQ, sliding window eval Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ttt_adapt() function trained on ALL validation data for N epochs BEFORE scoring — a direct violation of issue #402 (score-first rule). Removed: - ttt_adapt() function (bulk val-data training) - TTT hyperparameters (ttt_lr, ttt_epochs, etc.) - TTT invocation in main() - ttt_enabled forced to False The legal alternative is eval_val_sliding_ttt() in train_gpt_v7.py which scores each chunk before training on it. Audit status: - train_gpt_swiglu.py: FIXED (this commit) - train_gpt_swiglu_frugendorff.py: CLEAN (no TTT) - train_gpt_swiglu_frugendorff_stacked.py: CLEAN (no TTT) - train_gpt_v7.py: LEGAL (score-first sliding window) - Old exp_*/sota*/pr3* dirs: contain legacy illegal TTT but are historical experiments, not submission scripts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Core question: how many Frugendorff loops maximize GPTQ quality per compressed byte? Each loop saves ~2.9M params (~1.7MB compressed) but reuses the same weights. Tests: loops 3/4/5, share position 3 vs 4, bigram + MLP levers. All on train_gpt_swiglu_frugendorff.py (clean, no TTT). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upgrades train_gpt_swiglu.py with every proven technique for max quality: - seq_len 1024→2048, batch 524K→786K (PR #505: -0.009 BPB) - LeakyReLU(0.5)² replaces ReLU² (preserves negative gradient flow) - VRL: sigmoid-gated first-block mixing into attention input - Legal score-first TTT ported from v7 (disabled by default) - int8 GPTQ for attn.proj (lower quant tax on sensitive layers) - grad_clip 0→0.3, EMA 0.9985→0.997, warmdown 6000→3500 - All illegal TTT remains purged. Score-first only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Block 0's vrl_lambda never receives gradient (v_first is None for first block). DDP requires find_unused_parameters=True to handle this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-quantizes existing final_model.pt with 8 different GPTQ configs (percdamp 0.002-0.05, block_size 64-256). Zero training cost. Tests if different GPTQ settings compress better on Frugendorff weights. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Our best legal SOTA. Script + README + reproduce instructions. Three copies because we are never losing this again. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Original train_gpt_swiglu.py restored to pre-modification state. All F1 changes (VRL, LeakyReLU, seq2048, legal TTT, int8) live in train_gpt_swiglu_f1.py. Never overwrite a working baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR#505 base + VRL + LeakyReLU(0.5)² + int8 attn.proj + seq2048. 4521 steps @ 132.7ms, post-GPTQ sliding 1.1208. Beats current SOTA (1.1215) on quality alone — over 16MB budget, awaiting Frugendorff compression calibration. No TTT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Mar 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: 11L + EMA(0.997) + Tight SWA + Late QAT(0.15) + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243)
Key Innovation: EMA + Tight SWA Stacking + Earlier Late QAT
Three improvements on the PR #374 architecture:
Architecture
Training
Quantization
Results (3 seeds, 8xH100 SXM)
Best: 1.1243 | Mean: 1.1248 | Std: 0.0006
vs PR #374 (previous non-TTT record)
Run