Skip to content

Record: 9L XSA-all + LeakyReLU² + 5-gram eval cache — val_bpb 1.0909 (3-seed mean)#740

Closed
resouer wants to merge 2 commits intoopenai:mainfrom
resouer:submission/9L-XSA-ngram
Closed

Record: 9L XSA-all + LeakyReLU² + 5-gram eval cache — val_bpb 1.0909 (3-seed mean)#740
resouer wants to merge 2 commits intoopenai:mainfrom
resouer:submission/9L-XSA-ngram

Conversation

@resouer
Copy link
Copy Markdown

@resouer resouer commented Mar 25, 2026

Summary

3-seed mean val_bpb: 1.0909 (std=0.0011) | 14.7 MB | 8xH100 SXM

Results

Seed Pre-ngram BPB Post-ngram BPB Artifact
1337 1.1700 1.0898 14.68 MB
42 1.1701 1.0909 14.69 MB
7 1.1700 1.0920 14.68 MB
Mean 1.1700 1.0909 (std 0.0011)

Key Techniques

Training (9L/512d, 17.6M params)

  • 9L transformer, 512d, 8H/4KV GQA, MLP 2x, LeakyReLU(0.5)²
  • XSA (Exclusive Self-Attention) on all 9 layers
  • SmearGate, BigramHash(4096), OrthoInit, LN Scale, Partial RoPE (25%)
  • Muon optimizer, seq2048, batch 786K, warmdown 3500
  • Int8 per-row quantization + zstd-22 (near-zero quant degradation)

Eval: Online 5-gram Cache (-0.079 BPB)

  • Hashed 5-gram frequency table (4M buckets) from scored tokens
  • Fixed-weight linear mixing: mixed = 0.8 * p_model + 0.2 * p_ngram
  • Score-first, backward-looking, no target-aware gating
  • 132s eval time (well within 600s budget)

Reproduce

SEED=1337 NUM_LAYERS=9 MLP_MULT=2 QUANT_BITS=8 GPTQ_ENABLED=0 PRUNE_PCT=0 NGRAM_ENABLED=1 \
  torchrun --nproc_per_node=8 train_gpt.py

Credits

N-gram eval cache concept: @deanbrr (PR #659), @newjordan (PR #674)

…(3-seed)

3-seed validation:
  seed 1337: val_bpb=1.0898, 14.68 MB
  seed 42:   val_bpb=1.0909, 14.69 MB
  seed 7:    val_bpb=1.0920, 14.68 MB
  mean:      1.0909 (std=0.0011)

Architecture: 9L/512d, XSA-all, LeakyReLU(0.5)², SmearGate, BigramHash(4096),
OrthoInit, LN Scale, Partial RoPE. Int8 quantization + zstd-22.

Key technique: Online hashed 5-gram eval cache with fixed-weight linear mixing
(alpha=0.20). Gives -0.079 BPB improvement at eval time. 132s eval time.

Training: 8xH100 SXM, 600s wallclock, ~6900 steps at 87ms/step.
Improved from 1.0909 to 1.0238 with:
- Multi-order backoff (orders 2-7) with separate per-order hash tables
- Entropy-adaptive alpha: 0.05 + 0.55*sigmoid(2*(H-4.0))
- Alpha=0.40 base (was 0.20)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@resouer resouer closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant