Skip to content

Submission: 11L NTK-RoPE + FA3 + Batch524K + XSA4 + EMA (val_bpb=1.1328)#369

Closed
signalrush wants to merge 2 commits intoopenai:mainfrom
signalrush:submission/ntk-rope-fa3-batch524k
Closed

Submission: 11L NTK-RoPE + FA3 + Batch524K + XSA4 + EMA (val_bpb=1.1328)#369
signalrush wants to merge 2 commits intoopenai:mainfrom
signalrush:submission/ntk-rope-fa3-batch524k

Conversation

@signalrush
Copy link
Copy Markdown
Contributor

Submission: 11L NTK-RoPE + FA3 + Batch524K + XSA4 + EMA

val_bpb: 1.1328 (sliding window stride=64, 3-seed mean) | 15.87 MB (mean) | 8xH100 SXM, 600s

Key Techniques

Change Impact
NTK-aware RoPE Auto-scales RoPE base when seq_len > train_seq_len=1024, ~4x base scaling at seq=2048
FlashAttention 3 (Hopper) 58ms/step vs 99ms SDPA — 71% more training steps (10,300+ vs 6,030)
Batch=524K More gradient updates per 600s budget
Adaptive Pruning Auto-finds minimal prune % to fit 16MB per seed
XSA on last 4 layers Exclusive Self Attention (arXiv:2603.09078)
EMA (decay=0.997) Smooth weight averaging every step

Results (3 seeds, 8xH100 SXM)

Seed Steps Sliding BPB (s64) Artifact Prune %
42 10,368 1.1344 15.89 MB 14%
1337 10,346 1.1320 15.89 MB 10%
2024 10,358 1.1319 15.83 MB 12%

Mean: 1.1328 | Std: 0.0014 | Submitted: seed 2024 (best)

Architecture

  • 11L, 512d, 8H/4KV, MLP 3x (relu²), U-Net skips
  • SmearGate + BigramHash(4096, dim=128) + OrthoInit + muP
  • NTK-aware RoPE, logit softcap=30
  • Muon (lr=0.025, mom=0.99) + AdamW (tied_embed_lr=0.035)
  • WD=0.04, warmdown=3000, grad_clip=0.3
  • Int5 MLP / Int6 attn / Int8 embed + zstd-22

Run Command

SEED=2024 bash eval/eval.sh

FA3 Installation

Built from Dao-AILab/flash-attention hopper/ subdirectory with:

FLASH_ATTENTION_DISABLE_SM80=TRUE FLASH_ATTENTION_DISABLE_FP16=TRUE \
FLASH_ATTENTION_DISABLE_FP8=TRUE TORCH_CUDA_ARCH_LIST="9.0a" \
python setup.py install

Test plan

  • All 3 seeds under 16MB artifact limit
  • All 3 seeds train in 600s on 8xH100
  • Post-quant roundtrip verified per seed
  • Sliding window eval (stride=64) consistent across seeds
  • train_gpt.py under 1500 lines (1421)

🤖 Generated with Claude Code

@signalrush signalrush closed this Mar 25, 2026
haikosys pushed a commit to haikosys/parameter-golf that referenced this pull request Mar 30, 2026
…nai#400 openai#369 openai#398)

KEY DISCOVERY: PR#414 stacks EMA + Tight SWA together (-0.0006 BPB free)
GPTQ should be per-ROW not per-matrix (-0.0006 BPB)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant