Skip to content

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265

Merged
cocohearts merged 1 commit intoopenai:mainfrom
unnir:submission/v22-XSA3-beats-top1
Mar 23, 2026
Merged

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265
cocohearts merged 1 commit intoopenai:mainfrom
unnir:submission/v22-XSA3-beats-top1

Conversation

@unnir
Copy link
Copy Markdown
Contributor

@unnir unnir commented Mar 20, 2026

11L + Efficient Partial XSA (val_bpb: 1.1307)

Results

  • val_bpb: 1.1307 (sliding window, stride=64)
  • Pre-quantization BPB: 1.1437
  • Model parameters: 26,829,913
  • Artifact size: 15,892,986 bytes (under 16MB limit)
  • Training: 6,976 steps in 600 seconds (~86ms/step)
  • SWA: 13 checkpoint average during warmdown (every 120 steps)

Novel Contribution: Efficient Partial Exclusive Self Attention (XSA)

Based on Exclusive Self Attention (arXiv:2603.09078), we introduce two key improvements:

1. Efficient GQA-Aware Implementation

Standard XSA with Grouped Query Attention requires repeat_interleave to expand value vectors
from num_kv_heads to num_heads, doubling memory allocation per layer. Our implementation
uses a free reshape into KV head groups + broadcasting:

# OLD: expensive tensor duplication
v_expanded = v.repeat_interleave(group_size, dim=-2)  # allocates 2x memory
vn = normalize(v_expanded)
y = y - dot(y, vn) * vn

# NEW: free reshape + broadcast (zero allocation)
y_grouped = y.reshape(B, T, Hkv, group_size, D)      # view, no copy
vn = normalize(v).unsqueeze(-2)                        # [B,T,Hkv,1,D]
y = (y_grouped - dot(y_grouped, vn) * vn).reshape(B, T, H, D)

This reduces XSA overhead from ~7ms/step to ~2ms/step at 11 layers with GQA (8 heads, 4 KV heads).

2. Partial Application to Deepest Layers Only

The XSA paper shows self-attention bias (cosine similarity between output and self-value)
increases across layers. We apply XSA only to the last 3 layers (out of 11), targeting
the layers with highest self-attention bias while minimizing compute overhead.

Combined, these give ~0.002 BPB improvement over the baseline at <2ms/step cost.

Architecture

  • 11 transformer layers, 512-dim, 8 heads (4 KV heads via GQA)
  • 3x MLP expansion (1536 hidden), relu-squared activation
  • U-Net skip connections (encoder=5, decoder=6)
  • SmearGate + BigramHash (2048 buckets, dim=128)
  • Tied embeddings, logit softcap=30.0
  • NTK-aware RoPE (train_seq_len=1024, auto-scales at 2048)
  • XSA on layers 8, 9, 10 (deepest 3 of 11)

Training

  • FlashAttention 3 (Hopper-optimized)
  • Muon optimizer: lr=0.025, momentum=0.99 (warmup from 0.92 over 1500 steps)
  • AdamW for embeddings/scalars: lr=0.035/0.025
  • Weight decay: 0.04 (both Muon and AdamW)
  • Warmdown: 3000 iterations, grad clip 0.3
  • SWA every 120 steps (scale < 0.5), 13 checkpoint uniform average
  • OrthoInit + muP-scaled output projections
  • Seed: 1337

Quantization

  • Int6 per-row quantization on MLP + attention weights
  • Int8 for embeddings
  • zstd level 22 compression

Run Command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
SWA_EVERY=120 SWA_ENABLED=1 MTP_NUM_HEADS=0 SEED=1337 \
WARMUP_STEPS=30 VAL_LOSS_EVERY=2000 XSA_LAST_N=3 \
torchrun --nproc_per_node=8 train_gpt.py

References

  Novel: Efficient Partial Exclusive Self Attention on last 3 layers.
  GQA-aware reshape avoids tensor duplication (<2ms overhead).
  Beats prior SOTA (1.1318) by 0.0011 BPB. 15.9MB artifact.
@mohosy
Copy link
Copy Markdown

mohosy commented Mar 20, 2026

the gqa aware xsa reshape is really clean, way better than the repeat_interleave approach. curious if you tried applying it to more than 3 layers or if the gains plateau after that

newjordan referenced this pull request in newjordan/parameter-golf Mar 21, 2026
- XSA: subtract self-attention projection in last N layers (arXiv:2603.09078)
  Zero params, GQA-aware implementation. PR #265 shows +0.002 BPB.
- Corrected v2 config based on leaderboard analysis:
  * 11L not 10L (consensus at top)
  * Int6 not int5 (int5 penalty outweighs savings at 11L per #236)
  * WD 0.04 (consensus — SWA makes it work)
  * XSA last 3 layers
  * Seq ramp 256→1024 (novel)
  * zstd-22 + 3% pruning

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HyperPotatoNeo added a commit to HyperPotatoNeo/parameter-golf that referenced this pull request Mar 21, 2026
Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264),
MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048),
SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer.
Single-seed result (seed=1337), ~8903 steps on 8xH100.
rarce added a commit to rarce/parameter-golf that referenced this pull request Mar 21, 2026
Three techniques from the top PRs (openai#265, openai#287, openai#297):

1. XSA (Exclusive Self Attention) on last 3 layers (XSA_LAST_N=3):
   Removes self-value bias via orthogonal projection (arXiv:2603.09078).
   GQA-aware: uses reshape+broadcast instead of repeat_interleave.
   Zero new parameters, ~2ms/step overhead.

2. EMA (decay=0.997) replaces SWA (EMA_ENABLED=1, SWA_ENABLED=0):
   Exponential moving average updated every step during warmdown.
   Smoother weight averaging, better generalization/compression.

3. Late QAT (QAT_LATE_FRAC=0.85):
   QAT activates at 85% of wallclock to avoid Muon momentum corruption.
   LR halved when QAT activates (per PR openai#297 finding).

Trimmed comments to stay under 1500-line cap (1457 lines).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cocohearts
Copy link
Copy Markdown
Collaborator

ty for keepign pr's and diffs clean, much appreciated

newjordan referenced this pull request in newjordan/parameter-golf Mar 21, 2026
#1 untried combination from competition commentary:
TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB
XSA_LAST_N=3 excludes self-attention in final 3 layers.
Zero extra params, frees attention capacity for cross-token focus.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cocohearts cocohearts merged commit 56a9283 into openai:main Mar 23, 2026
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
openai#1 untried combination from competition commentary:
TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB
XSA_LAST_N=3 excludes self-attention in final 3 layers.
Zero extra params, frees attention capacity for cross-token focus.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
openai#1 untried combination from competition commentary:
TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB
XSA_LAST_N=3 excludes self-attention in final 3 layers.
Zero extra params, frees attention capacity for cross-token focus.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026
Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants