Record: 11L + EMA + Tight SWA + QAT0.15 + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243) by newjordan · Pull Request #401 · openai/parameter-golf

newjordan · 2026-03-22T05:15:58Z

Record: 11L + EMA(0.997) + Tight SWA + Late QAT(0.15) + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243)

Key Innovation: EMA + Tight SWA Stacking + Earlier Late QAT

Three improvements on the PR #374 architecture:

EMA (decay=0.997) stacked with Tight SWA — SWA collects from EMA-averaged weights, giving two orthogonal averaging signals
Earlier Late QAT (scale<0.15 vs 0.1) — 50% more steps under int6 fake-quantization, shrinks quant gap from 0.008 to 0.007
Longer warmdown (3500 vs 3000 iters) — extends convergence tail

Architecture

11 transformer layers, 512-dim, 8 heads (4 KV heads, GQA)
3x MLP expansion with relu-squared activation
Partial RoPE (16/64 dims) + NTK-aware scaling
LN Scale Factor 1/sqrt(layer_idx+1)
U-Net skip connections (5 encoder, 6 decoder)
SmearGate + BigramHash (2048 buckets, dim=128)
Shared Value Embedding (dim=128, layers 9,10)
FlashAttention 3 (Hopper)
Logit softcap 30.0, tied embeddings

Training

Muon optimizer (matrices): lr=0.025, momentum=0.99 (warmup 0.92->0.99 over 1500 steps), WD=0.04
AdamW (embeddings): lr=0.035, (scalars): lr=0.025, WD=0.04
Gradient clip: 0.3
Batch: 786,432 tokens/step, seq_len=2048
Warmdown: 3500 iters (wallclock-based)
EMA: decay=0.997, float32 accumulation, SWA collects from EMA weights
Tight SWA: every 50 steps when scale<0.2 (14 checkpoints from EMA)
Late QAT: STE int6 fake-quantization when LR scale<0.15

Quantization

Int6 per-row for MLP + attention weights
Int8 per-row for embeddings
Control tensors in fp32
zstd level 22 compression

Results (3 seeds, 8xH100 SXM)

Seed	Steps	Sliding BPB (s64)	Post-avg BPB	Quant gap	Artifact
1337	7005	1.1243	1.1412	0.0069	15.88 MB
42	7001	1.1247	1.1417	0.0067	16.06 MB
7	7007	1.1255	1.1424	0.0068	15.68 MB

Best: 1.1243 | Mean: 1.1248 | Std: 0.0006

vs PR #374 (previous non-TTT record)

Metric	PR374	Ours	Delta
Sliding BPB	1.1246	1.1243	-0.0003
Quant gap	0.0080	0.0068	-0.0012
Artifact	15.71 MB	15.88 MB	+0.17 MB

Run

SEED=1337 NUM_LAYERS=11 MLP_MULT=3.0 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 \
SWA_ENABLED=1 SWA_EVERY=50 LATE_QAT_THRESHOLD=0.15 WARMDOWN_ITERS=3500 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
EMA_ENABLED=1 EMA_DECAY=0.997 \
BIGRAM_VOCAB_SIZE=2048 BIGRAM_DIM=128 ADAM_WD=0.04 MUON_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
torchrun --nproc_per_node=8 train_gpt.py

…AttnRes

… gravity needs more steps

@timowhite88

11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep Exact reproduction of @timowhite88's FarnsworthEngine recipe. No modifications — run as-is to validate baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

8 Q heads with 4 KV heads needs repeat_interleave before matmul. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export) exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count) exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio All based on PR #254 SOTA clone (1.1303 BPB). Priority: exp_c first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TTT v2 (cosine LR decay, discriminative per-layer LR, low momentum 0.3, WD), seq-length curriculum (256→2048), batch warmup (262K→786K), D2Z LR schedule, XSA last 3, temperature scaling, optional Mousse optimizer. Two run scripts: full stack (run_v2.sh) and conservative TTT-only (run_v2_ttt_only.sh). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

flash_attn_interface (FA3 Hopper) → flash_attn (FA2) → torch SDPA. Script never crashes on missing flash-attn. Run scripts attempt pip install on startup if FA3 not found. Applied to both sota254 and sota_v2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…+ untested v2 Restores all four files to their state at 83efa9c. The FA3→FA2→SDPA fallback was added in response to an environment question and should not have touched application code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

torch.compile can promote tensors to fp32 which hits missing FA3 kernels (disabled at build time). Explicit bf16 cast prevents silent NaN output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

A (MTP): 1.1619 BPB roundtrip — worse than baseline B (SwiGLU): 1.1348 BPB sliding — close but +0.0045 vs baseline Both artifacts over 16MB due to missing zstandard (zlib fallback) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…combine The self-exclusion mask + causal mask leaves position 0 with all -inf, producing NaN from softmax. Fix: don't self-exclude position 0 since it has no other causal targets to attend to. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

XSA_LAST_N=3 was costing ~25% step time due to manual matmul path. Set to 0 to isolate TTT v2 + temp scaling gains at full speed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

XSA manual attention killed step speed, only 4771/9000 steps completed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…seline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

exp_a MTP: 1.1619, exp_b SwiGLU: 1.1570, exp_c: missing tokenizer data. TTT v1 hurt in both exp_a and exp_b (same pattern as TTT v2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same model/artifact as SOTA254 baseline — zero risk. More TTT adaptation (3→8 epochs) and finer sliding window (64→32 stride). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TTT_SAM=1 enables SAM during test-time training. Two forward+backward passes per step: first computes gradient, perturbs weights by rho in gradient direction, then recomputes gradient at the perturbed point. Uses the perturbed gradient to update original weights, seeking flatter minima that generalize better. Motivated by TTT consistently overfitting: loss goes down but eval gets worse across all runs. SAM directly targets this failure mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Exact settings from README. If this doesn't reproduce, the FA3 build is the variable, not the code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same training as the 1.1303 baseline, only change is TTT_SAM=1. SAM seeks flatter minima during test-time training to fix the TTT overfitting pattern (loss down, eval up). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TTT 8 epochs + stride 32. Stride made no difference — all gain from extra TTT adaptation. Same model/artifact, eval-only change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both seeds beat baseline. TTT 8 epochs is a free win. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seed 7 compresses worse than 1337/42. BPB improved but artifact exceeds 16 MB cap. Need passing 3rd seed for submission. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ROPE_DIMS=16: apply rotary to 25% of head dims, rest position-free LN_SCALE=1: scale RMSNorm output by 1/sqrt(layer+1) Both env-var gated, default off — existing runs unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Star-ReLU(x)² replaces SiLU in SwiGLU (GEPA non-TTT SOTA uses this). Value Residual Learning: per-block learned lambda mixes current hidden state with first-block output for attention values (-0.015 BPB in PR#413). MLP trimmed 4.5->4.0 to fit 16MB budget (~24.9M params, ~13.8MB est). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SwiGLU fork (PR #462 base) + GPTQ + OptRot + AdamW TTT = 1.0763 BPB but artifact is 19.6MB (over 16MB limit). OptRot Hadamard rotation hurts zstd compression. Next step: solve the size problem. v7 GPTQ stack submitted as PR #508: 3-seed mean 1.1215 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Matches the 1.0763 run's settings via fractal compression: - Full MHA 8/8 (was GQA 8/4) - seq_len=1024 (2x more steps: ~8000 vs ~4300) - batch=524288 (even more steps) - MLP 3.5x SwiGLU Star-ReLU (fits 16MB with full MHA) - EMA decay 0.9985 (GEPA's setting) - warmdown 6000 (70% of expected steps) - bigram 8192, XSA last 3, VE on blocks 4,5 - 6 unique x 2 loops = 12 effective depth - ~24.6M params, est ~14.5MB artifact The 1.0763 model compressed into the 16MB box via fractal weight sharing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three configs to map the compression cost curve on H100: - v4 (6x2): 12 eff depth, 24.6M, ~13.3MB - v4a (5x2): 10 eff depth, 20.8M, ~11.3MB - v4b (4x2): 8 eff depth, 17.0M, ~9.2MB Spark data says 5x2 is optimal (beats flat by 0.011 BPB). H100 results will give the real calibration number. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

run_v7_short_ttt.sh: self-contained test script for Option A. SGD lr=0.002, 3 epochs, freeze 2 blocks, no EMA smoothing, stop training at chunk 50. Captures chunk-51 peak (1.1106 observed) without EMA dilution that killed TTT gains in PR #508. train_gpt_v7.py: add TTT_WARMUP_CHUNKS env var for optional LR warmup during TTT (default 0 = no change to existing behavior). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR #503 proves XSA on all 11 layers helps (1.1195 vs our 1.1215 with XSA on 4). Combined with short TTT no-EMA to target 1st place. Just an env var change (XSA_LAST_N=11), no code modifications. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Single-seed result on 8xH100 SXM, 600s training + 192s eval: Seed 1337: sliding_window=1.12070, legal_ttt=1.12075 Artifact: 15.60 MB (int6+zstd-22) Pre-TTT sliding window is the effective score — short TTT (SGD, 50 chunks, no EMA) was net neutral (+0.00005). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Surgical weight sharing: SHARE_START=4, SHARE_LOOPS=3 replaces 3 middle layers with 1 shared block looped 3x. NUM_LAYERS=9 gives 9 stored blocks, 11 effective depth. Saves 2 blocks (~4.4MB). 19.6MB -> ~15.2MB = fits 16MB budget. Orthogonal loop positions on the shared block. Rest of architecture unchanged: Star-ReLU SwiGLU, full MHA 8/8, GPTQ, TTT, EMA. The Frugendorff as a compression tool on an existing SOTA model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ITERATIONS=7500 matches actual wallclock steps so cosine warmdown completes (LR→0) instead of stopping at ~45% peak. Combined with XSA on all 11 layers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

XSA-11 model was 24KB over 16MB limit. Trimming bigram hash table saves ~32K raw params to fit within budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On top of the SwiGLU compression shim: - LeakyReLU(0.5)² replaces ReLU² (preserves negative gradients) - VRL per-block lambda mixing first-block output (-0.015 BPB in PR#413) - decoder_lr_mult=2.0 already present from base Test A (clean): train_gpt_swiglu_frugendorff.py = pure compression cost Test B (stacked): train_gpt_swiglu_frugendorff_stacked.py = compression + extras Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

4 tests in 2 batches for 2xGPU research: A: Bigram 1536 + XSA-11 (size fit) B: Bigram 1024 + XSA-11 (aggressive size) C: GPTQ percdamp=0.05 (conservative error compensation) D: GPTQ block_size=64 (less error accumulation) Wire GPTQ_BLOCK_SIZE and GPTQ_PERCDAMP as env vars. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Removed ttt_adapt function, all TTT hyperparameters, and TTT call sites from both clean and stacked versions. TTT trains on validation tokens before scoring — illegal per issue #402. All remaining features are pure training/architecture/quantization techniques: Star-ReLU, SwiGLU, GPTQ, EMA, U-Net, BigramHash, Frugendorff compression, VRL, LeakyReLU, decoder_lr_mult. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Systematic search for optimal weight sharing config: Batch 1: Size frontier (loop4, earlier sharing, fewer layers) Batch 2: Quality frontier (depth vs sharing tradeoffs) Batch 3: Compression levers (bigram, MLP tuning) Baseline: 11L/SHARE4/LOOPS3 = 1.0900 BPB, 16.68MB (over by 680KB) Target: fit ≤16MB while keeping BPB near 1.09 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Illegal TTT trains on validation tokens before scoring them, violating issue #402. Disabled TTT_ENABLED default to "0" in: - train_gpt_swiglu.py - train_gpt_frugendorff_v3.py - train_gpt_frugendorff_v4.py, v4a, v4b - train_gpt_v7.py, v7_short_ttt.py Removed eval_ttt.py (standalone illegal TTT eval). Legal techniques preserved: - TTT burst (training data replay) in v1/v4/v5/v6/squared - Inner-TTT in fractal h100 scripts (our own implementation) - All training, EMA, GPTQ, sliding window eval Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The ttt_adapt() function trained on ALL validation data for N epochs BEFORE scoring — a direct violation of issue #402 (score-first rule). Removed: - ttt_adapt() function (bulk val-data training) - TTT hyperparameters (ttt_lr, ttt_epochs, etc.) - TTT invocation in main() - ttt_enabled forced to False The legal alternative is eval_val_sliding_ttt() in train_gpt_v7.py which scores each chunk before training on it. Audit status: - train_gpt_swiglu.py: FIXED (this commit) - train_gpt_swiglu_frugendorff.py: CLEAN (no TTT) - train_gpt_swiglu_frugendorff_stacked.py: CLEAN (no TTT) - train_gpt_v7.py: LEGAL (score-first sliding window) - Old exp_*/sota*/pr3* dirs: contain legacy illegal TTT but are historical experiments, not submission scripts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Core question: how many Frugendorff loops maximize GPTQ quality per compressed byte? Each loop saves ~2.9M params (~1.7MB compressed) but reuses the same weights. Tests: loops 3/4/5, share position 3 vs 4, bigram + MLP levers. All on train_gpt_swiglu_frugendorff.py (clean, no TTT). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Upgrades train_gpt_swiglu.py with every proven technique for max quality: - seq_len 1024→2048, batch 524K→786K (PR #505: -0.009 BPB) - LeakyReLU(0.5)² replaces ReLU² (preserves negative gradient flow) - VRL: sigmoid-gated first-block mixing into attention input - Legal score-first TTT ported from v7 (disabled by default) - int8 GPTQ for attn.proj (lower quant tax on sensitive layers) - grad_clip 0→0.3, EMA 0.9985→0.997, warmdown 6000→3500 - All illegal TTT remains purged. Score-first only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Block 0's vrl_lambda never receives gradient (v_first is None for first block). DDP requires find_unused_parameters=True to handle this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Re-quantizes existing final_model.pt with 8 different GPTQ configs (percdamp 0.002-0.05, block_size 64-256). Zero training cost. Tests if different GPTQ settings compress better on Frugendorff weights. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Our best legal SOTA. Script + README + reproduce instructions. Three copies because we are never losing this again. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Original train_gpt_swiglu.py restored to pre-modification state. All F1 changes (VRL, LeakyReLU, seq2048, legal TTT, int8) live in train_gpt_swiglu_f1.py. Never overwrite a working baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR#505 base + VRL + LeakyReLU(0.5)² + int8 attn.proj + seq2048. 4521 steps @ 132.7ms, post-GPTQ sliding 1.1208. Beats current SOTA (1.1215) on quality alone — over 16MB budget, awaiting Frugendorff compression calibration. No TTT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Octavian and others added 30 commits March 18, 2026 18:06

docs: fractal transformer research plan — weight sharing + gravity + …

6e503d9

…AttnRes

results: first local ladder — fractal 3x3 beats baseline by 7.1% BPB,…

73271f3

… gravity needs more steps

Fix XSA GQA broadcast bug — expand KV heads before manual attention

4e4cc7f

8 Q heads with 4 KV heads needs repeat_interleave before matmul. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix FA3 NaN: cast qkv to bf16 before FA3 call, disable dynamo DDP opt

7171b6a

torch.compile can promote tensors to fp32 which hits missing FA3 kernels (disabled at build time). Explicit bf16 cast prevents silent NaN output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add 2-seed validation scripts for exp A/B/C

c0adf16

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Disable XSA in ttt_only run — manual attention too slow vs FA3

0b2c73c

XSA_LAST_N=3 was costing ~25% step time due to manual matmul path. Set to 0 to isolate TTT v2 + temp scaling gains at full speed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add run_v2_ttt_noXSA.sh — TTT v2 + temp scaling, all FA3, max speed

2d79228

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Restore XSA_LAST_N=3 in run_v2_ttt_only.sh (keep existing test intact)

508cdf1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log v2 TTT-only + XSA=3 result: 1.1982 BPB (worse than 1.1301 baseline)

c1e74ba

XSA manual attention killed step speed, only 4771/9000 steps completed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Strip verbose logging from v2 train loop — match baseline format

f263214

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log v2 noXSA result: 1.1538/1.1315 BPB — TTT v2 hurt, no edge over ba…

7bdf6de

…seline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log exp_a/b/c results: all worse than 1.1301 baseline, exp_c never ran

2620ec3

exp_a MTP: 1.1619, exp_b SwiGLU: 1.1570, exp_c: missing tokenizer data. TTT v1 hurt in both exp_a and exp_b (same pattern as TTT v2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add exp D: TTT 8 epochs + stride 32 (eval-only improvement)

aea1e39

Same model/artifact as SOTA254 baseline — zero risk. More TTT adaptation (3→8 epochs) and finer sliding window (64→32 stride). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add baseline reproduction script — verify 1.1303 on current FA3 build

4fb1bec

Exact settings from README. If this doesn't reproduce, the FA3 build is the variable, not the code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log exp D result: 1.1295 BPB — new best (-0.0008 vs baseline)

9d86a37

TTT 8 epochs + stride 32. Stride made no difference — all gain from extra TTT adaptation. Same model/artifact, eval-only change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log exp D seed 42: 1.1307 BPB — confirms improvement (mean 1.1301)

79c9c2a

Both seeds beat baseline. TTT 8 epochs is a free win. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add exp_d SAM variant — TTT 8ep + stride 32 + sharpness-aware TTT

87c2831

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log exp D seed 7: 1.1313 BPB but 16.18 MB — over size limit

e24283a

Seed 7 compresses worse than 1337/42. BPB improved but artifact exceeds 16 MB cap. Need passing 3rd seed for submission. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add exp_d/run_sam_clean.sh — pure SAM A/B test, no other changes

753ebd1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Octavian and others added 23 commits March 23, 2026 01:49

Add v7 smooth run: proper warmdown + XSA-all

47b7700

ITERATIONS=7500 matches actual wallclock steps so cosine warmdown completes (LR→0) instead of stopping at ~45% peak. Combined with XSA on all 11 layers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reduce bigram buckets 2048→1792 for XSA-11 size fit

da0f71a

XSA-11 model was 24KB over 16MB limit. Trimming bigram hash table saves ~32K raw params to fit within budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix DDP: find_unused_parameters=True for VRL block 0

77a8a18

Block 0's vrl_lambda never receives gradient (v_first is None for first block). DDP requires find_unused_parameters=True to handle this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Save Gold Standard (GS): v7 GPTQ 1.1206 BPB — 3 copies

3e62406

Our best legal SOTA. Script + README + reproduce instructions. Three copies because we are never losing this again. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

newjordan closed this Mar 23, 2026

This was referenced Mar 25, 2026

Podracing: 1.0461 BPB (3-seed mean) #674

Closed

Podracing: 1.0461 BPB (3-seed mean) — 5-gram eval + LeakyReLU² #706

Open

abaybektursun mentioned this pull request Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

newjordan mentioned this pull request Mar 25, 2026

Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964) #753

Open

abaybektursun mentioned this pull request Mar 28, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Merged

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L + EMA + Tight SWA + QAT0.15 + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243)#401

Record: 11L + EMA + Tight SWA + QAT0.15 + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243)#401
newjordan wants to merge 134 commits intoopenai:mainfrom
newjordan:experiments/pr374-edge

newjordan commented Mar 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newjordan commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: 11L + EMA(0.997) + Tight SWA + Late QAT(0.15) + VE128 + Partial RoPE + LN Scale (val_bpb: 1.1243)

Key Innovation: EMA + Tight SWA Stacking + Earlier Late QAT

Architecture

Training

Quantization

Results (3 seeds, 8xH100 SXM)

vs PR #374 (previous non-TTT record)

Run

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

newjordan commented Mar 22, 2026 •

edited

Loading