Non-record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100) by aryanbhosale · Pull Request #344 · openai/parameter-golf

aryanbhosale · 2026-03-21T14:57:21Z

Record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack

3-seed mean val_bpb: 1.1330 (std=0.0007) on 8xH100 SXM

Seed	val_bpb	val_loss	Steps
1337	1.1334	1.9136	3842
42	1.1322	1.9116	3885
2024	1.1334	1.9136	3857

Architecture (31.4M params)

11L, 512d, 8H/4KV, MLP 3.5x (hidden=1792) with LeakyReLU(0.5)^2
SmearGate + BigramHash(10240) + TrigramHash(4096) + OrthoInit
Value Residual (ResFormer) + Gated Attention + XSA all 11 layers
Partial RoPE (16/64 dims), U-Net skip connections
Tied FP16 embeddings, logit softcap 30.0

Training

Muon lr=0.03, WD=0.04, momentum 0.92->0.99/1500 steps
EMA (decay=0.997), batch 786K, seq_len 2048
Warmdown 3500, Late QAT via STE (final 15%), gradient clip 0.3

Quantization

Int6 uniform + GPTQ-lite (5-percentile clip search) + zstd-22
FP16 passthrough for tied embeddings

Evaluation

Sliding window stride=64

Development

30-experiment autoresearch loop on 1xH100 (~8 hours). Validated on 8xH100 SXM (RunPod).

Feature ablation (1xH100):

Value Residual: -0.017 BPB
SmearGate: -0.010
XSA all 11: -0.005
Gated Attention: -0.004
Partial RoPE: -0.004
TrigramHash: -0.002
Late QAT: -0.002

Built on SOTA (10L, int5/int6, BigramHash, SmearGate, SWA) with 75+ automated experiments across Mac MLX and 1xH100 CUDA. Key findings: - NUM_HEADS=4 with head_dim=128: -0.095 BPB relative improvement - Step-based LR schedule: -0.483 BPB vs wallclock-based - BigramHash(16384): -0.025 BPB vs 10240 - MATRIX_LR=0.03: -0.003 BPB Tested on 1xH100 (800 steps, 600s). Post-quant val_bpb: 1.2756 with sliding window eval stride=256. Known issue: artifact is 17.4MB (over 16MB) due to head_dim=128 increasing params. Needs int4/int5 MLP compression to fit budget.

…le, EMA, Late QAT, TTT Major rewrite targeting top-5 leaderboard: - 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB - XSA (Exclusive Self-Attention) on last 4 layers - Partial RoPE: 16/64 head dims get position encoding - LN Scale: 1/sqrt(layer+1) dampening on deeper layers - EMA (decay=0.997) replaces SWA - Late QAT: STE int6 enabled only in final 4% of training - TTT: 25-epoch SGD on val data post-quantization - FA3 auto-detection with SDPA fallback - Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)

…H100) - 11L 512d 8H/4KV MLP3x LeakyReLU(0.5)^2 - SmearGate + BigramHash(10240) + TrigramHash(4096) - Value Residual + Gated Attention + XSA-all-11 - Partial RoPE(16/64) + int6 GPTQ-lite + zstd-22 - Score-first TTT (3 epochs, per-layer LR, cosine schedule) - Developed via 30-experiment autoresearch on 1xH100 - Artifact: 8.5MB (under 16MB)

…1.1330, 3-seed, 8xH100) - 31.4M params, 11L 512d 8H/4KV MLP3.5x(1792) - LeakyReLU(0.5)^2, SmearGate, BigramHash(10240), TrigramHash(4096) - Value Residual, Gated Attention, XSA-all-11, Partial RoPE(16/64) - Muon lr=0.03, EMA(0.997), Late QAT, int6 GPTQ-lite + zstd-22 - 3-seed: 1.1334/1.1322/1.1334, mean=1.1330, std=0.0007 - Developed via 30-experiment autoresearch on 1xH100

notapplica mentioned this pull request Mar 21, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

aryanbhosale changed the title ~~Non-record: Autoresearch Heads4 + Step-based LR + Sliding Window (1xH100)~~ Non-record: 11L Full SOTA Stack + Score-First TTT (val_bpb=1.1383, 1xH100) Mar 24, 2026

aryanbhosale changed the title ~~Non-record: 11L Full SOTA Stack + Score-First TTT (val_bpb=1.1383, 1xH100)~~ Record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100) Mar 24, 2026

aryanbhosale changed the title ~~Record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100)~~ Non-record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100) Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100)#344

Non-record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100)#344
aryanbhosale wants to merge 3 commits intoopenai:mainfrom
aryanbhosale:submission/autoresearch-heads4

aryanbhosale commented Mar 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aryanbhosale commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack

Architecture (31.4M params)

Training

Quantization

Evaluation

Development

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aryanbhosale commented Mar 21, 2026 •

edited

Loading