Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) by unnir · Pull Request #265 · openai/parameter-golf

unnir · 2026-03-20T20:42:43Z

11L + Efficient Partial XSA (val_bpb: 1.1307)

Results

val_bpb: 1.1307 (sliding window, stride=64)
Pre-quantization BPB: 1.1437
Model parameters: 26,829,913
Artifact size: 15,892,986 bytes (under 16MB limit)
Training: 6,976 steps in 600 seconds (~86ms/step)
SWA: 13 checkpoint average during warmdown (every 120 steps)

Novel Contribution: Efficient Partial Exclusive Self Attention (XSA)

Based on Exclusive Self Attention (arXiv:2603.09078), we introduce two key improvements:

1. Efficient GQA-Aware Implementation

Standard XSA with Grouped Query Attention requires repeat_interleave to expand value vectors
from num_kv_heads to num_heads, doubling memory allocation per layer. Our implementation
uses a free reshape into KV head groups + broadcasting:

# OLD: expensive tensor duplication
v_expanded = v.repeat_interleave(group_size, dim=-2)  # allocates 2x memory
vn = normalize(v_expanded)
y = y - dot(y, vn) * vn

# NEW: free reshape + broadcast (zero allocation)
y_grouped = y.reshape(B, T, Hkv, group_size, D)      # view, no copy
vn = normalize(v).unsqueeze(-2)                        # [B,T,Hkv,1,D]
y = (y_grouped - dot(y_grouped, vn) * vn).reshape(B, T, H, D)

This reduces XSA overhead from ~7ms/step to ~2ms/step at 11 layers with GQA (8 heads, 4 KV heads).

2. Partial Application to Deepest Layers Only

The XSA paper shows self-attention bias (cosine similarity between output and self-value)
increases across layers. We apply XSA only to the last 3 layers (out of 11), targeting
the layers with highest self-attention bias while minimizing compute overhead.

Combined, these give ~0.002 BPB improvement over the baseline at <2ms/step cost.

Architecture

11 transformer layers, 512-dim, 8 heads (4 KV heads via GQA)
3x MLP expansion (1536 hidden), relu-squared activation
U-Net skip connections (encoder=5, decoder=6)
SmearGate + BigramHash (2048 buckets, dim=128)
Tied embeddings, logit softcap=30.0
NTK-aware RoPE (train_seq_len=1024, auto-scales at 2048)
XSA on layers 8, 9, 10 (deepest 3 of 11)

Training

FlashAttention 3 (Hopper-optimized)
Muon optimizer: lr=0.025, momentum=0.99 (warmup from 0.92 over 1500 steps)
AdamW for embeddings/scalars: lr=0.035/0.025
Weight decay: 0.04 (both Muon and AdamW)
Warmdown: 3000 iterations, grad clip 0.3
SWA every 120 steps (scale < 0.5), 13 checkpoint uniform average
OrthoInit + muP-scaled output projections
Seed: 1337

Quantization

Int6 per-row quantization on MLP + attention weights
Int8 for embeddings
zstd level 22 compression

Run Command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
SWA_EVERY=120 SWA_ENABLED=1 MTP_NUM_HEADS=0 SEED=1337 \
WARMUP_STEPS=30 VAL_LOSS_EVERY=2000 XSA_LAST_N=3 \
torchrun --nproc_per_node=8 train_gpt.py

References

Exclusive Self Attention: arXiv:2603.09078 (Shuangfei Zhai, 2026)

Novel: Efficient Partial Exclusive Self Attention on last 3 layers. GQA-aware reshape avoids tensor duplication (<2ms overhead). Beats prior SOTA (1.1318) by 0.0011 BPB. 15.9MB artifact.

mohosy · 2026-03-20T23:59:28Z

the gqa aware xsa reshape is really clean, way better than the repeat_interleave approach. curious if you tried applying it to more than 3 layers or if the gains plateau after that

- XSA: subtract self-attention projection in last N layers (arXiv:2603.09078) Zero params, GQA-aware implementation. PR #265 shows +0.002 BPB. - Corrected v2 config based on leaderboard analysis: * 11L not 10L (consensus at top) * Int6 not int5 (int5 penalty outweighs savings at 11L per #236) * WD 0.04 (consensus — SWA makes it work) * XSA last 3 layers * Seq ramp 256→1024 (novel) * zstd-22 + 3% pruning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264), MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048), SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer. Single-seed result (seed=1337), ~8903 steps on 8xH100.

Three techniques from the top PRs (openai#265, openai#287, openai#297): 1. XSA (Exclusive Self Attention) on last 3 layers (XSA_LAST_N=3): Removes self-value bias via orthogonal projection (arXiv:2603.09078). GQA-aware: uses reshape+broadcast instead of repeat_interleave. Zero new parameters, ~2ms/step overhead. 2. EMA (decay=0.997) replaces SWA (EMA_ENABLED=1, SWA_ENABLED=0): Exponential moving average updated every step during warmdown. Smoother weight averaging, better generalization/compression. 3. Late QAT (QAT_LATE_FRAC=0.85): QAT activates at 85% of wallclock to avoid Muon momentum corruption. LR halved when QAT activates (per PR openai#297 finding). Trimmed comments to stay under 1500-line cap (1457 lines). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cocohearts · 2026-03-21T17:01:55Z

ty for keepign pr's and diffs clean, much appreciated

#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openai#1 untried combination from competition commentary: TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) — NEW SOTA

a81f85b

Novel: Efficient Partial Exclusive Self Attention on last 3 layers. GQA-aware reshape avoids tensor duplication (<2ms overhead). Beats prior SOTA (1.1318) by 0.0011 BPB. 15.9MB artifact.

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

ibarrajo mentioned this pull request Mar 20, 2026

flash_attn_interface (FA3) missing from runpod/parameter-golf:latest image #280

Open

ibarrajo mentioned this pull request Mar 21, 2026

Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354) #290

Open

7 tasks

JackYoung27 mentioned this pull request Mar 21, 2026

Non-record: 11L int5/int6 + XSA + online TTT w/ decay prior (single-run val_bpb=1.1520) #302

Open

cocohearts added the record submission ready for review label Mar 21, 2026

HyperPotatoNeo mentioned this pull request Mar 21, 2026

11L + XSA4 + EMA(0.997) + seq2048 + Int5-MLP + MuonWD=0.04 + LateK-FP16 | val_bpb=1.1361 #372

Closed

rarce mentioned this pull request Mar 23, 2026

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #534

Closed

6 tasks

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252) #526

Open

rarce mentioned this pull request Mar 23, 2026

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #543

Open

6 tasks

This was referenced Mar 23, 2026

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461

Open

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB) #456

Open

cocohearts merged commit 56a9283 into openai:main Mar 23, 2026

cocohearts mentioned this pull request Mar 23, 2026

Update README leaderboard with merged record submissions #561

Merged

This was referenced Mar 25, 2026

[Closed] EMA + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393) #187

Closed

[Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #810

Closed

nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026

Merge pull request openai#265 from unnir/submission/v22-XSA3-beats-top1

97d9d0f

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)

Idan3011 mentioned this pull request Mar 27, 2026

[Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #972

Closed

aerosta mentioned this pull request Mar 28, 2026

Record: 11L XSA + Mixed INT6 + Adaptive N-gram Cache (2->7 backoff) - val_bpb=0.9631, 3-seed #993

Closed

Idan3011 mentioned this pull request Mar 28, 2026

Pre-Enrichment + EMA + SmearGate + XSA4 (val_bpb=1.1470(mean), … #996

Open

mikeapedia mentioned this pull request Mar 29, 2026

Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d) #1089

Open

This was referenced Mar 31, 2026

Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean) #1166

Open

Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225) #1170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265
cocohearts merged 1 commit intoopenai:mainfrom
unnir:submission/v22-XSA3-beats-top1

unnir commented Mar 20, 2026 •

edited

Loading

Uh oh!

mohosy commented Mar 20, 2026

Uh oh!

cocohearts commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

unnir commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

11L + Efficient Partial XSA (val_bpb: 1.1307)

Results

Novel Contribution: Efficient Partial Exclusive Self Attention (XSA)

1. Efficient GQA-Aware Implementation

2. Partial Application to Deepest Layers Only

Architecture

Training

Quantization

Run Command

References

Uh oh!

mohosy commented Mar 20, 2026

Uh oh!

cocohearts commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

unnir commented Mar 20, 2026 •

edited

Loading