Skip to content

Record: 11L + EMA + Tight SWA + QAT0.15 + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243)#401

Closed
newjordan wants to merge 134 commits intoopenai:mainfrom
newjordan:experiments/pr374-edge
Closed

Record: 11L + EMA + Tight SWA + QAT0.15 + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243)#401
newjordan wants to merge 134 commits intoopenai:mainfrom
newjordan:experiments/pr374-edge

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Mar 22, 2026

Record: 11L + EMA(0.997) + Tight SWA + Late QAT(0.15) + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243)

Key Innovation: EMA + Tight SWA Stacking + Earlier Late QAT

Three improvements on the PR #374 architecture:

  1. EMA (decay=0.997) stacked with Tight SWA — SWA collects from EMA-averaged weights, giving two orthogonal averaging signals
  2. Earlier Late QAT (scale<0.15 vs 0.1) — 50% more steps under int6 fake-quantization, shrinks quant gap from 0.008 to 0.007
  3. Longer warmdown (3500 vs 3000 iters) — extends convergence tail

Architecture

  • 11 transformer layers, 512-dim, 8 heads (4 KV heads, GQA)
  • 3x MLP expansion with relu-squared activation
  • Partial RoPE (16/64 dims) + NTK-aware scaling
  • LN Scale Factor 1/sqrt(layer_idx+1)
  • U-Net skip connections (5 encoder, 6 decoder)
  • SmearGate + BigramHash (2048 buckets, dim=128)
  • Shared Value Embedding (dim=128, layers 9,10)
  • FlashAttention 3 (Hopper)
  • Logit softcap 30.0, tied embeddings

Training

  • Muon optimizer (matrices): lr=0.025, momentum=0.99 (warmup 0.92->0.99 over 1500 steps), WD=0.04
  • AdamW (embeddings): lr=0.035, (scalars): lr=0.025, WD=0.04
  • Gradient clip: 0.3
  • Batch: 786,432 tokens/step, seq_len=2048
  • Warmdown: 3500 iters (wallclock-based)
  • EMA: decay=0.997, float32 accumulation, SWA collects from EMA weights
  • Tight SWA: every 50 steps when scale<0.2 (14 checkpoints from EMA)
  • Late QAT: STE int6 fake-quantization when LR scale<0.15

Quantization

  • Int6 per-row for MLP + attention weights
  • Int8 per-row for embeddings
  • Control tensors in fp32
  • zstd level 22 compression

Results (3 seeds, 8xH100 SXM)

Seed Steps Sliding BPB (s64) Post-avg BPB Quant gap Artifact
1337 7005 1.1243 1.1412 0.0069 15.88 MB
42 7001 1.1247 1.1417 0.0067 16.06 MB
7 7007 1.1255 1.1424 0.0068 15.68 MB

Best: 1.1243 | Mean: 1.1248 | Std: 0.0006

vs PR #374 (previous non-TTT record)

Metric PR374 Ours Delta
Sliding BPB 1.1246 1.1243 -0.0003
Quant gap 0.0080 0.0068 -0.0012
Artifact 15.71 MB 15.88 MB +0.17 MB

Run

SEED=1337 NUM_LAYERS=11 MLP_MULT=3.0 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 \
SWA_ENABLED=1 SWA_EVERY=50 LATE_QAT_THRESHOLD=0.15 WARMDOWN_ITERS=3500 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
EMA_ENABLED=1 EMA_DECAY=0.997 \
BIGRAM_VOCAB_SIZE=2048 BIGRAM_DIM=128 ADAM_WD=0.04 MUON_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
torchrun --nproc_per_node=8 train_gpt.py

Octavian and others added 30 commits March 18, 2026 18:06
11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep
Exact reproduction of @timowhite88's FarnsworthEngine recipe.
No modifications — run as-is to validate baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#1 untried combination from competition commentary:
TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB
XSA_LAST_N=3 excludes self-attention in final 3 layers.
Zero extra params, frees attention capacity for cross-token focus.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 Q heads with 4 KV heads needs repeat_interleave before matmul.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export)
exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count)
exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio

All based on PR #254 SOTA clone (1.1303 BPB). Priority: exp_c first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TTT v2 (cosine LR decay, discriminative per-layer LR, low momentum 0.3, WD),
seq-length curriculum (256→2048), batch warmup (262K→786K), D2Z LR schedule,
XSA last 3, temperature scaling, optional Mousse optimizer.

Two run scripts: full stack (run_v2.sh) and conservative TTT-only (run_v2_ttt_only.sh).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
flash_attn_interface (FA3 Hopper) → flash_attn (FA2) → torch SDPA.
Script never crashes on missing flash-attn. Run scripts attempt
pip install on startup if FA3 not found.

Applied to both sota254 and sota_v2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…+ untested v2

Restores all four files to their state at 83efa9c. The FA3→FA2→SDPA
fallback was added in response to an environment question and should
not have touched application code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
torch.compile can promote tensors to fp32 which hits missing FA3 kernels
(disabled at build time). Explicit bf16 cast prevents silent NaN output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A (MTP): 1.1619 BPB roundtrip — worse than baseline
B (SwiGLU): 1.1348 BPB sliding — close but +0.0045 vs baseline
Both artifacts over 16MB due to missing zstandard (zlib fallback)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…combine

The self-exclusion mask + causal mask leaves position 0 with all -inf,
producing NaN from softmax. Fix: don't self-exclude position 0 since
it has no other causal targets to attend to.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XSA_LAST_N=3 was costing ~25% step time due to manual matmul path.
Set to 0 to isolate TTT v2 + temp scaling gains at full speed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XSA manual attention killed step speed, only 4771/9000 steps completed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…seline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
exp_a MTP: 1.1619, exp_b SwiGLU: 1.1570, exp_c: missing tokenizer data.
TTT v1 hurt in both exp_a and exp_b (same pattern as TTT v2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same model/artifact as SOTA254 baseline — zero risk.
More TTT adaptation (3→8 epochs) and finer sliding window (64→32 stride).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TTT_SAM=1 enables SAM during test-time training. Two forward+backward
passes per step: first computes gradient, perturbs weights by rho in
gradient direction, then recomputes gradient at the perturbed point.
Uses the perturbed gradient to update original weights, seeking flatter
minima that generalize better.

Motivated by TTT consistently overfitting: loss goes down but eval
gets worse across all runs. SAM directly targets this failure mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exact settings from README. If this doesn't reproduce, the FA3 build
is the variable, not the code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same training as the 1.1303 baseline, only change is TTT_SAM=1.
SAM seeks flatter minima during test-time training to fix the
TTT overfitting pattern (loss down, eval up).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TTT 8 epochs + stride 32. Stride made no difference — all gain from
extra TTT adaptation. Same model/artifact, eval-only change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both seeds beat baseline. TTT 8 epochs is a free win.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed 7 compresses worse than 1337/42. BPB improved but artifact
exceeds 16 MB cap. Need passing 3rd seed for submission.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ROPE_DIMS=16: apply rotary to 25% of head dims, rest position-free
LN_SCALE=1: scale RMSNorm output by 1/sqrt(layer+1)
Both env-var gated, default off — existing runs unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Octavian and others added 23 commits March 23, 2026 01:49
Star-ReLU(x)² replaces SiLU in SwiGLU (GEPA non-TTT SOTA uses this).
Value Residual Learning: per-block learned lambda mixes current hidden
state with first-block output for attention values (-0.015 BPB in PR#413).
MLP trimmed 4.5->4.0 to fit 16MB budget (~24.9M params, ~13.8MB est).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SwiGLU fork (PR #462 base) + GPTQ + OptRot + AdamW TTT = 1.0763 BPB
but artifact is 19.6MB (over 16MB limit). OptRot Hadamard rotation
hurts zstd compression. Next step: solve the size problem.

v7 GPTQ stack submitted as PR #508: 3-seed mean 1.1215 BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Matches the 1.0763 run's settings via fractal compression:
- Full MHA 8/8 (was GQA 8/4)
- seq_len=1024 (2x more steps: ~8000 vs ~4300)
- batch=524288 (even more steps)
- MLP 3.5x SwiGLU Star-ReLU (fits 16MB with full MHA)
- EMA decay 0.9985 (GEPA's setting)
- warmdown 6000 (70% of expected steps)
- bigram 8192, XSA last 3, VE on blocks 4,5
- 6 unique x 2 loops = 12 effective depth
- ~24.6M params, est ~14.5MB artifact

The 1.0763 model compressed into the 16MB box via fractal weight sharing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three configs to map the compression cost curve on H100:
- v4 (6x2): 12 eff depth, 24.6M, ~13.3MB
- v4a (5x2): 10 eff depth, 20.8M, ~11.3MB
- v4b (4x2): 8 eff depth, 17.0M, ~9.2MB

Spark data says 5x2 is optimal (beats flat by 0.011 BPB).
H100 results will give the real calibration number.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
run_v7_short_ttt.sh: self-contained test script for Option A.
SGD lr=0.002, 3 epochs, freeze 2 blocks, no EMA smoothing,
stop training at chunk 50. Captures chunk-51 peak (1.1106 observed)
without EMA dilution that killed TTT gains in PR #508.

train_gpt_v7.py: add TTT_WARMUP_CHUNKS env var for optional
LR warmup during TTT (default 0 = no change to existing behavior).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR #503 proves XSA on all 11 layers helps (1.1195 vs our 1.1215
with XSA on 4). Combined with short TTT no-EMA to target 1st place.

Just an env var change (XSA_LAST_N=11), no code modifications.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-seed result on 8xH100 SXM, 600s training + 192s eval:
  Seed 1337: sliding_window=1.12070, legal_ttt=1.12075
  Artifact: 15.60 MB (int6+zstd-22)

Pre-TTT sliding window is the effective score — short TTT
(SGD, 50 chunks, no EMA) was net neutral (+0.00005).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Surgical weight sharing: SHARE_START=4, SHARE_LOOPS=3 replaces
3 middle layers with 1 shared block looped 3x. NUM_LAYERS=9 gives
9 stored blocks, 11 effective depth. Saves 2 blocks (~4.4MB).
19.6MB -> ~15.2MB = fits 16MB budget.

Orthogonal loop positions on the shared block. Rest of architecture
unchanged: Star-ReLU SwiGLU, full MHA 8/8, GPTQ, TTT, EMA.

The Frugendorff as a compression tool on an existing SOTA model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ITERATIONS=7500 matches actual wallclock steps so cosine warmdown
completes (LR→0) instead of stopping at ~45% peak. Combined with
XSA on all 11 layers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XSA-11 model was 24KB over 16MB limit. Trimming bigram hash table
saves ~32K raw params to fit within budget.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On top of the SwiGLU compression shim:
- LeakyReLU(0.5)² replaces ReLU² (preserves negative gradients)
- VRL per-block lambda mixing first-block output (-0.015 BPB in PR#413)
- decoder_lr_mult=2.0 already present from base

Test A (clean): train_gpt_swiglu_frugendorff.py = pure compression cost
Test B (stacked): train_gpt_swiglu_frugendorff_stacked.py = compression + extras

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tests in 2 batches for 2xGPU research:
  A: Bigram 1536 + XSA-11 (size fit)
  B: Bigram 1024 + XSA-11 (aggressive size)
  C: GPTQ percdamp=0.05 (conservative error compensation)
  D: GPTQ block_size=64 (less error accumulation)

Wire GPTQ_BLOCK_SIZE and GPTQ_PERCDAMP as env vars.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed ttt_adapt function, all TTT hyperparameters, and TTT call
sites from both clean and stacked versions. TTT trains on validation
tokens before scoring — illegal per issue #402.

All remaining features are pure training/architecture/quantization
techniques: Star-ReLU, SwiGLU, GPTQ, EMA, U-Net, BigramHash,
Frugendorff compression, VRL, LeakyReLU, decoder_lr_mult.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Systematic search for optimal weight sharing config:
  Batch 1: Size frontier (loop4, earlier sharing, fewer layers)
  Batch 2: Quality frontier (depth vs sharing tradeoffs)
  Batch 3: Compression levers (bigram, MLP tuning)

Baseline: 11L/SHARE4/LOOPS3 = 1.0900 BPB, 16.68MB (over by 680KB)
Target: fit ≤16MB while keeping BPB near 1.09

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Illegal TTT trains on validation tokens before scoring them,
violating issue #402. Disabled TTT_ENABLED default to "0" in:
- train_gpt_swiglu.py
- train_gpt_frugendorff_v3.py
- train_gpt_frugendorff_v4.py, v4a, v4b
- train_gpt_v7.py, v7_short_ttt.py
Removed eval_ttt.py (standalone illegal TTT eval).

Legal techniques preserved:
- TTT burst (training data replay) in v1/v4/v5/v6/squared
- Inner-TTT in fractal h100 scripts (our own implementation)
- All training, EMA, GPTQ, sliding window eval

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ttt_adapt() function trained on ALL validation data for N epochs
BEFORE scoring — a direct violation of issue #402 (score-first rule).

Removed:
- ttt_adapt() function (bulk val-data training)
- TTT hyperparameters (ttt_lr, ttt_epochs, etc.)
- TTT invocation in main()
- ttt_enabled forced to False

The legal alternative is eval_val_sliding_ttt() in train_gpt_v7.py
which scores each chunk before training on it.

Audit status:
- train_gpt_swiglu.py: FIXED (this commit)
- train_gpt_swiglu_frugendorff.py: CLEAN (no TTT)
- train_gpt_swiglu_frugendorff_stacked.py: CLEAN (no TTT)
- train_gpt_v7.py: LEGAL (score-first sliding window)
- Old exp_*/sota*/pr3* dirs: contain legacy illegal TTT but are
  historical experiments, not submission scripts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Core question: how many Frugendorff loops maximize GPTQ
quality per compressed byte? Each loop saves ~2.9M params
(~1.7MB compressed) but reuses the same weights.

Tests: loops 3/4/5, share position 3 vs 4, bigram + MLP levers.
All on train_gpt_swiglu_frugendorff.py (clean, no TTT).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upgrades train_gpt_swiglu.py with every proven technique for max quality:
- seq_len 1024→2048, batch 524K→786K (PR #505: -0.009 BPB)
- LeakyReLU(0.5)² replaces ReLU² (preserves negative gradient flow)
- VRL: sigmoid-gated first-block mixing into attention input
- Legal score-first TTT ported from v7 (disabled by default)
- int8 GPTQ for attn.proj (lower quant tax on sensitive layers)
- grad_clip 0→0.3, EMA 0.9985→0.997, warmdown 6000→3500
- All illegal TTT remains purged. Score-first only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Block 0's vrl_lambda never receives gradient (v_first is None for first
block). DDP requires find_unused_parameters=True to handle this.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-quantizes existing final_model.pt with 8 different GPTQ configs
(percdamp 0.002-0.05, block_size 64-256). Zero training cost.
Tests if different GPTQ settings compress better on Frugendorff weights.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Our best legal SOTA. Script + README + reproduce instructions.
Three copies because we are never losing this again.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Original train_gpt_swiglu.py restored to pre-modification state.
All F1 changes (VRL, LeakyReLU, seq2048, legal TTT, int8) live in
train_gpt_swiglu_f1.py. Never overwrite a working baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR#505 base + VRL + LeakyReLU(0.5)² + int8 attn.proj + seq2048.
4521 steps @ 132.7ms, post-GPTQ sliding 1.1208. Beats current SOTA
(1.1215) on quality alone — over 16MB budget, awaiting Frugendorff
compression calibration. No TTT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant