Skip to content

Non-record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100)#344

Open
aryanbhosale wants to merge 3 commits intoopenai:mainfrom
aryanbhosale:submission/autoresearch-heads4
Open

Non-record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100)#344
aryanbhosale wants to merge 3 commits intoopenai:mainfrom
aryanbhosale:submission/autoresearch-heads4

Conversation

@aryanbhosale
Copy link
Copy Markdown

@aryanbhosale aryanbhosale commented Mar 21, 2026

Record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack

3-seed mean val_bpb: 1.1330 (std=0.0007) on 8xH100 SXM

Seed val_bpb val_loss Steps
1337 1.1334 1.9136 3842
42 1.1322 1.9116 3885
2024 1.1334 1.9136 3857

Architecture (31.4M params)

  • 11L, 512d, 8H/4KV, MLP 3.5x (hidden=1792) with LeakyReLU(0.5)^2
  • SmearGate + BigramHash(10240) + TrigramHash(4096) + OrthoInit
  • Value Residual (ResFormer) + Gated Attention + XSA all 11 layers
  • Partial RoPE (16/64 dims), U-Net skip connections
  • Tied FP16 embeddings, logit softcap 30.0

Training

  • Muon lr=0.03, WD=0.04, momentum 0.92->0.99/1500 steps
  • EMA (decay=0.997), batch 786K, seq_len 2048
  • Warmdown 3500, Late QAT via STE (final 15%), gradient clip 0.3

Quantization

  • Int6 uniform + GPTQ-lite (5-percentile clip search) + zstd-22
  • FP16 passthrough for tied embeddings

Evaluation

  • Sliding window stride=64

Development

30-experiment autoresearch loop on 1xH100 (~8 hours). Validated on 8xH100 SXM (RunPod).

Feature ablation (1xH100):

  • Value Residual: -0.017 BPB
  • SmearGate: -0.010
  • XSA all 11: -0.005
  • Gated Attention: -0.004
  • Partial RoPE: -0.004
  • TrigramHash: -0.002
  • Late QAT: -0.002

Built on SOTA (10L, int5/int6, BigramHash, SmearGate, SWA) with 75+
automated experiments across Mac MLX and 1xH100 CUDA.

Key findings:
- NUM_HEADS=4 with head_dim=128: -0.095 BPB relative improvement
- Step-based LR schedule: -0.483 BPB vs wallclock-based
- BigramHash(16384): -0.025 BPB vs 10240
- MATRIX_LR=0.03: -0.003 BPB

Tested on 1xH100 (800 steps, 600s). Post-quant val_bpb: 1.2756
with sliding window eval stride=256.

Known issue: artifact is 17.4MB (over 16MB) due to head_dim=128
increasing params. Needs int4/int5 MLP compression to fit budget.
EthanYangTW added a commit to EthanYangTW/parameter-golf that referenced this pull request Mar 22, 2026
…le, EMA, Late QAT, TTT

Major rewrite targeting top-5 leaderboard:
- 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB
- XSA (Exclusive Self-Attention) on last 4 layers
- Partial RoPE: 16/64 head dims get position encoding
- LN Scale: 1/sqrt(layer+1) dampening on deeper layers
- EMA (decay=0.997) replaces SWA
- Late QAT: STE int6 enabled only in final 4% of training
- TTT: 25-epoch SGD on val data post-quantization
- FA3 auto-detection with SDPA fallback
- Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)
…H100)

- 11L 512d 8H/4KV MLP3x LeakyReLU(0.5)^2
- SmearGate + BigramHash(10240) + TrigramHash(4096)
- Value Residual + Gated Attention + XSA-all-11
- Partial RoPE(16/64) + int6 GPTQ-lite + zstd-22
- Score-first TTT (3 epochs, per-layer LR, cosine schedule)
- Developed via 30-experiment autoresearch on 1xH100
- Artifact: 8.5MB (under 16MB)
@aryanbhosale aryanbhosale changed the title Non-record: Autoresearch Heads4 + Step-based LR + Sliding Window (1xH100) Non-record: 11L Full SOTA Stack + Score-First TTT (val_bpb=1.1383, 1xH100) Mar 24, 2026
…1.1330, 3-seed, 8xH100)

- 31.4M params, 11L 512d 8H/4KV MLP3.5x(1792)
- LeakyReLU(0.5)^2, SmearGate, BigramHash(10240), TrigramHash(4096)
- Value Residual, Gated Attention, XSA-all-11, Partial RoPE(16/64)
- Muon lr=0.03, EMA(0.997), Late QAT, int6 GPTQ-lite + zstd-22
- 3-seed: 1.1334/1.1322/1.1334, mean=1.1330, std=0.0007
- Developed via 30-experiment autoresearch on 1xH100
@aryanbhosale aryanbhosale changed the title Non-record: 11L Full SOTA Stack + Score-First TTT (val_bpb=1.1383, 1xH100) Record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100) Mar 24, 2026
@aryanbhosale aryanbhosale changed the title Record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100) Non-record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100) Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants