Skip to content

Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)#493

Open
parinzee wants to merge 1 commit intoopenai:mainfrom
parinzee:submission/2026-03-23-11L-EMA-Int6
Open

Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)#493
parinzee wants to merge 1 commit intoopenai:mainfrom
parinzee:submission/2026-03-23-11L-EMA-Int6

Conversation

@parinzee
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1309 (mean of 3 seeds, std: 0.00017)
  • 11 layers, 512 dim, 8H/4KV GQA
  • Artifact: ~15.8 MB (all seeds under 16 MB)

3-Seed Results

Seed val_bpb artifact_bytes
42 1.13109 15,764,564
1337 1.13085 15,626,741
2024 1.13067 15,923,256
Mean 1.13087
Std 0.00017

Key Changes from Baseline

  1. 11 layers (up from 10), 512 dim, 8 heads / 4 KV heads (GQA)
  2. XSA (Exclusive Self Attention) on last 4 layers for better representation
  3. LeakyReLU(0.5)² activation — squared leaky ReLU with 0.5 negative slope
  4. Partial RoPE — only 16/64 dims use rotary embeddings
  5. EMA weight averaging (decay=0.997) for smoother final weights
  6. Int6 quantization for all large weight matrices + zstd-22 compression
  7. Scale clamping fix — clamp_min(1/clip_range) improves quantization quality
  8. Smaller batch size (524288 tokens) to fit more training steps (~8200 steps in 600s)
  9. BigramHash(2048, dim=128) token embeddings
  10. warmdown_iters=4500 for learning rate schedule
  11. Higher learning rates (matrix_lr=0.025, scalar_lr=0.025)

Run Command

torchrun --standalone --nproc_per_node=8 train_gpt.py

Built on SOTA baseline by @thwu1 (PR #180).

…1309)

3-seed validation results:
- Seed 42:   val_bpb=1.13109, artifact=15,764,564 bytes
- Seed 1337: val_bpb=1.13085, artifact=15,626,741 bytes
- Seed 2024: val_bpb=1.13067, artifact=15,923,256 bytes
- Mean: 1.13087 (std: 0.00017)

Key techniques: 11 layers, GQA (8H/4KV), XSA on last 4 layers,
LeakyReLU(0.5)², Partial RoPE (16/64), EMA (0.997), int6 quantization,
zstd-22 compression, BigramHash(2048,128), warmdown_iters=4500.

Built on baseline by @thwu1 (PR openai#180).
sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 23, 2026
- replace relu(x)^2 with leaky_relu(x, 0.5)^2
- PR openai#493 reaches 1.1309 with partial stack using this activation
- untried on full openai#414 stack — could give -0.002 to -0.005 BPB
- zero param cost, zero speed overhead
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887):
- train_batch_tokens: 524K → 786K (all top entries use this)
- bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240)
- grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping)
- Star-ReLU and TrigramHash enabled in all run scripts
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
RoyiRa added a commit to RoyiRa/parameter-golf that referenced this pull request Mar 23, 2026
…bpb 1.1178

3-seed mean: 1.1178 BPB (std 0.0005), ~15.75 MB artifact, 8×H100 SXM.

Novel contribution: Late Soft-Round QAT — replaces STE identity surrogate
with sigmoid soft-round in the backward pass during the final 2% of training,
giving bin-aware gradients that settle weights onto int6 grid points.

Built on PR openai#414 (base model), PR openai#461 (TTT recipe), PR openai#493 (LeakyReLU²).
Fraser-Greenlee pushed a commit to Fraser-Greenlee/parameter-golf that referenced this pull request Mar 25, 2026
- Interleaved draft tokens: soft predictions placed between real tokens
  for 1-2 token lookahead via standard causal attention
- SmearGate and BigramHash naturally gain future context on interleaved seq
- Bigram noise curriculum: drafts anneal from GT to realistic noise
- Two-pass eval: pass 1 generates drafts, pass 2 refines with interleaving
- LeakyReLU(0.5)² activation toggle (free -0.003 BPB from PR openai#493)
- W&B logging (opt-in via WANDB_PROJECT env var)
- Sweep runner with 13 configs covering baselines, draft variants, and ablations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
srchandrupatla added a commit to srchandrupatla/parameter-golf that referenced this pull request Mar 25, 2026
LeakyReLU(0.5)²: preserves negative gradient flow through MLP while
maintaining non-negative output. ~0.003 BPB improvement per PR openai#493.

Legal TTT (test-time training): at eval time, split val tokens into
32K-token chunks, score each chunk under inference_mode(), then train
on the already-scored chunk with SGD. Gives ~0.0025 BPB improvement
per PR openai#461. Score-first protocol guarantees no future information
leaks into scored tokens.
Mistobaan pushed a commit to Mistobaan/parameter-golf that referenced this pull request Mar 25, 2026
nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026
nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026
anish-krishnan pushed a commit to anish-krishnan/parameter-golf that referenced this pull request Mar 30, 2026
Itssshikhar pushed a commit to Itssshikhar/parameter-golf that referenced this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant