Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175)#569
Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175)#569gowtham0992 wants to merge 1 commit intoopenai:mainfrom
Conversation
f6abfae to
1a4f3aa
Compare
Stack of all verified improvements on current SOTA (PR openai#414, 1.1228 bpb): - VRL (Value Residual Learning, arxiv:2410.17897): layer 0 V shared via sigmoid gates - Full GPTQ (Hessian-aware Cholesky int6): -0.0026 bpb over GPTQ-lite - LeakyReLU(0.5)²: -0.0015 bpb - Batched LoRA TTT: rank=8 Q+V+LMhead all 11 layers, 2 epochs cosine LR - Score-before-train every chunk every epoch (backward-looking, fully legal) - EMA(0.997) + Tight SWA + Late QAT@0.15 + XSA-all(11) + Partial RoPE(16/64) - LN Scale + VE128(9,10) + SmearGate + BigramHash(2048) + Prune(2%) Expected: ~1.08–1.10 bpb (non-record pending 3-seed H100 validation) Attribution: signalrush (PR openai#414), gowtham0992 (PR openai#569), MatoTeziTanka (PR openai#512), LoquiAuris (PR openai#548) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BigramHash(8192) gives -0.0025 BPP but 16.37MB > 16MB. PR openai#569 (VRL sigmoid gates) launched for repro.
Pre-eval TTT was non-compliant per issue openai#402. Now uses score-first TTT: score each chunk before training on it. Added LeakyReLU(0.5)² replacing relu² (proven by openai#569, openai#535). Score pending rerun with compute credits.
Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb from this change. LeakyReLU preserves gradient flow through negative pre-activations while maintaining the sparsity/gating benefits of squaring. At 22M params, dead neurons from hard ReLU are expensive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Saves layer 0's raw V output and blends it into all subsequent layers via learned sigmoid gates (initialized at -1.5 ≈ 18% mixing). PR openai#569 achieves 1.1175 with VRL+LeakyReLU²+Full GPTQ (no TTT). VRL is orthogonal to our existing VE128 (shared value embedding). Enabled by default (VRL_ENABLED=1). Gate adds 1 scalar param per layer (10 params total for 11L). Zero compute overhead beyond the gated blend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
openai#569 VRL produces 16.0-16.2MB models (over budget). openai#535+XSA-all reliably fits at 15.6MB with 1.1188 BPP. 3-seed validation starting.
…x LeakyReLU author links
|
Are you counting the calibration time as part of the training 600s, or the eval 600s? If it's part of training (and you can prove it), then I am more inclined to believe this is legal (would have to look more into it) but does not meet the minimum nat-difference, so it is a non-sota valid submission, if it's part of eval time then this is not valid as it is accessing training data at eval time. |
Good catch — GPTQ calibration currently runs in an untimed gap after the training loop breaks at 600s and before the eval timer starts. It's ~10-15s of forward passes on training data to collect Hessians for weight rounding (no gradient updates). I can see how that's ambiguous under the rules, so I'll update the submission to count calibration time inside the 600s training budget. The BPB impact from fewer training steps should be minimal. Will push an updated version shortly. |
|
Also, if you don't mind I'll close this PR for now -- feel free to send more messages here but a new submission would have a new submission date, so I'd rather you made a new PR. |
- Autoresearch loop (program.md, loop.sh, generate_next.py) - Modal provider for 8xH100 training with checkpoint save/restore - Experiment framework with preflight size checks - eval_ttt.py for TTT evaluation against saved checkpoints - train_gpt_improved.py: PR openai#569 base (VRL, GPTQ, LeakyReLU², pruning) - train_gpt_576.py: PR openai#576 base (int5, 33.6M params, score-first TTT) - train_gpt_sota.py: PR openai#573 base - train_gpt_mlx_recurrent.py: depth recurrence experiments - Benchmark scripts for local MLX A/B testing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Architecture (matching frontier openai#569 + improvements): - 11 layers (from 10), XSA on last 4 layers - VRL: value residual from layer 1 to all subsequent layers - Partial RoPE: only first 16/64 dims get rotation - LeakyReLU(0.5)² activation (replaces relu²) - LN Scale: 1/sqrt(layer_idx+1) dampens deeper layers Training: - EMA decay=0.997 (replaces SWA) - Late QAT: activates when LR scale < 0.15 during warmdown - Removed MTP (hurt throughput, no BPB gain) - Removed LoRA TTT (legal version only 0.001 BPB improvement) Quantization: - GPTQ-lite: tries 5 clip percentiles per row, picks best MSE - Mixed int5 MLP / int6 attention + zstd-22 1425 lines (75 headroom for Mousse + Hyper-Connections)
New submission filed as #738 — GPTQ calibration now inside training budget per #677 ruling, plus novel eval-time innovations (hidden-state kNN-LM + 5-gram cache with pre-committed confidence gate, no safety gate). 3-seed mean 1.0970 BPB. |
Approach A (openai#569 int5 no TTT): 1.1317 — int5 penalty too high on d=512 Approach B (openai#576 d=576 int5 + legal s_0 TTT): 1.1188 — best legal result Approach C (GEPA int5 + TTT): artifact over 16MB Key lesson: TTT re-scoring is illegal (PR openai#991 closed for this). Only s_0 cumulative first-pass score is legal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Submission train_gpt.py with all 32 techniques from the execution plan, each gated by environment variables (disabled by default) - Optuna-based search framework with validate mode (per-technique smoke test) and search mode (TPE over joint technique + model size space) - Ablation infrastructure (ablation.py, shell scripts) for tracking experiments - PR source files for reference (openai#505, openai#569, openai#576, openai#727, openai#738) - Execution plan document Techniques span architecture (activations, HybridNorm, SmearGate, DiffAttn, PoPE, WaveletGPT, VGA, XSA), training (EMA, SWA, QAT, MTP), quantization (variable bit-width, OptRot, GPTQ, pruning, entropy coding), and eval-time (TTT-LoRA, n-gram cache, kNN-LM, TurboQuant KV compression). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
3-seed mean val_bpb: 1.1175 (std 0.0005) | ≤15.94 MB | 8xH100 SXM, 600s | No TTT
Key Techniques
Results
vs SOTA (1.1228): improvement = 0.0053 nats
Reproduction
pip install -r requirements.txt
python3 data/cached_challenge_fineweb.py --variant sp1024
BACKOUT_ENABLED=0 SEED=42 python3 -m torch.distributed.run --standalone --nproc_per_node=8 train_gpt.py