Skip to content

11L + XSA + VRL + SWA + seq4096 + cross-doc TTT - val_bpb 1.1839#457

Open
carlesonielfa wants to merge 1 commit intoopenai:mainfrom
carlesonielfa:submission/2026-03-22_11L_XSA_VRL_SWA_seq4096
Open

11L + XSA + VRL + SWA + seq4096 + cross-doc TTT - val_bpb 1.1839#457
carlesonielfa wants to merge 1 commit intoopenai:mainfrom
carlesonielfa:submission/2026-03-22_11L_XSA_VRL_SWA_seq4096

Conversation

@carlesonielfa
Copy link
Copy Markdown

Stacks several wins on the 11L dim=512 base:

  • seq_len=4096: long-context training (single largest contributor)
  • Exclusive Self-Attention (XSA): removes value-aligned component from attention output on deepest 4 layers
  • Value Residual Learning (VRL): per-layer learnable residual from layer-0 value vectors
  • SmearGate: learned token-blending gate at embedding layer
  • SWA: 24 checkpoints averaged from last 40% of warmdown
  • Cross-doc TTT: rank-8 LoRA adapters trained per document at eval time
  • Warmdown-QAT: near-zero quantization penalty

Results (seed=1337, 8xH100, 600s):

  • post-quant (int8+zlib): 1.2192
  • post-quant + TTT: 1.1839
  • model size: 15.35 MB

…=1.1839

11 layers, seq_len=4096, Exclusive Self-Attention (deepest 4 layers),
Value Residual Learning, SmearGate, SWA (24 ckpts), cross-doc TTT.
Post-quant: 1.2192. With TTT: 1.1839. Model size: 15.35 MB.
13137 steps on 8xH100 in 600s.
ThomAub pushed a commit to ThomAub/parameter-golf that referenced this pull request Mar 22, 2026
…, and PR openai#457 analysis

Comprehensive analysis of 4 TTC techniques for Parameter Golf:
- Sliding window eval (stride<seq_len for better context)
- Depth recurrence (shared layers, more loops at eval)
- Longer context eval with NTK RoPE scaling
- Checkpoint/depth ensemble strategies

Includes detailed analysis of PR openai#457's techniques (XSA, VRL, SmearGate, SWA,
cross-doc TTT) which achieves 1.1839 BPB. Cross-doc TTT identified as the
single biggest TTC win (+0.035 BPB).

https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
ThomAub pushed a commit to ThomAub/parameter-golf that referenced this pull request Mar 22, 2026
… budgets

Side-by-side comparison of 4 architectures:
- Baseline dense (17.1M, 1.224 BPB)
- Enhanced dense with PR#180/openai#457 techniques (~20.3M)
- Zero-cost MoE (same params, fewer FLOPs)
- Expanded MoE (34M params via int5/int6 compression)

Includes ASCII architecture diagrams, per-component parameter budgets,
quantization byte accounting, and step speed estimates.

https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant