Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265
Merged
cocohearts merged 1 commit intoopenai:mainfrom Mar 23, 2026
Merged
Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265cocohearts merged 1 commit intoopenai:mainfrom
cocohearts merged 1 commit intoopenai:mainfrom
Conversation
Novel: Efficient Partial Exclusive Self Attention on last 3 layers. GQA-aware reshape avoids tensor duplication (<2ms overhead). Beats prior SOTA (1.1318) by 0.0011 BPB. 15.9MB artifact.
|
the gqa aware xsa reshape is really clean, way better than the repeat_interleave approach. curious if you tried applying it to more than 3 layers or if the gains plateau after that |
7 tasks
newjordan
referenced
this pull request
in newjordan/parameter-golf
Mar 21, 2026
- XSA: subtract self-attention projection in last N layers (arXiv:2603.09078) Zero params, GQA-aware implementation. PR #265 shows +0.002 BPB. - Corrected v2 config based on leaderboard analysis: * 11L not 10L (consensus at top) * Int6 not int5 (int5 penalty outweighs savings at 11L per #236) * WD 0.04 (consensus — SWA makes it work) * XSA last 3 layers * Seq ramp 256→1024 (novel) * zstd-22 + 3% pruning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HyperPotatoNeo
added a commit
to HyperPotatoNeo/parameter-golf
that referenced
this pull request
Mar 21, 2026
Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264), MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048), SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer. Single-seed result (seed=1337), ~8903 steps on 8xH100.
rarce
added a commit
to rarce/parameter-golf
that referenced
this pull request
Mar 21, 2026
Three techniques from the top PRs (openai#265, openai#287, openai#297): 1. XSA (Exclusive Self Attention) on last 3 layers (XSA_LAST_N=3): Removes self-value bias via orthogonal projection (arXiv:2603.09078). GQA-aware: uses reshape+broadcast instead of repeat_interleave. Zero new parameters, ~2ms/step overhead. 2. EMA (decay=0.997) replaces SWA (EMA_ENABLED=1, SWA_ENABLED=0): Exponential moving average updated every step during warmdown. Smoother weight averaging, better generalization/compression. 3. Late QAT (QAT_LATE_FRAC=0.85): QAT activates at 85% of wallclock to avoid Muon momentum corruption. LR halved when QAT activates (per PR openai#297 finding). Trimmed comments to stay under 1500-line cap (1457 lines). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
|
ty for keepign pr's and diffs clean, much appreciated |
newjordan
referenced
this pull request
in newjordan/parameter-golf
Mar 21, 2026
#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
6 tasks
This was referenced Mar 23, 2026
newjordan
pushed a commit
to newjordan/parameter-golf-1
that referenced
this pull request
Mar 23, 2026
openai#1 untried combination from competition commentary: TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan
pushed a commit
to newjordan/parameter-golf-1
that referenced
this pull request
Mar 23, 2026
openai#1 untried combination from competition commentary: TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Mar 24, 2026
This was referenced Mar 25, 2026
nvemuri4649
pushed a commit
to thanushpatlolla/parameter-golf
that referenced
this pull request
Mar 27, 2026
Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
11L + Efficient Partial XSA (val_bpb: 1.1307)
Results
Novel Contribution: Efficient Partial Exclusive Self Attention (XSA)
Based on Exclusive Self Attention (arXiv:2603.09078), we introduce two key improvements:
1. Efficient GQA-Aware Implementation
Standard XSA with Grouped Query Attention requires
repeat_interleaveto expand value vectorsfrom
num_kv_headstonum_heads, doubling memory allocation per layer. Our implementationuses a free reshape into KV head groups + broadcasting:
This reduces XSA overhead from ~7ms/step to ~2ms/step at 11 layers with GQA (8 heads, 4 KV heads).
2. Partial Application to Deepest Layers Only
The XSA paper shows self-attention bias (cosine similarity between output and self-value)
increases across layers. We apply XSA only to the last 3 layers (out of 11), targeting
the layers with highest self-attention bias while minimizing compute overhead.
Combined, these give ~0.002 BPB improvement over the baseline at <2ms/step cost.
Architecture
Training
Quantization
Run Command
References