Skip to content

feat(record): Int6 STE + NorMuon + SWA + Sliding Window (val_bpb=1.16019)#156

Open
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:weco-int6-ste-normuon
Open

feat(record): Int6 STE + NorMuon + SWA + Sliding Window (val_bpb=1.16019)#156
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:weco-int6-ste-normuon

Conversation

@dexhunter
Copy link
Copy Markdown

Summary

New SOTA submission: mean val_bpb = 1.16019 across 3 seeds, beating current merged SOTA (1.17475) by 0.01456.

Seed val_bpb Steps ms/step Artifact
1337 1.16146 12357 48.55 15,045,740
42 1.15935 12351 48.58 15,053,489
7 1.15976 12336 48.69 15,157,415
Mean 1.16019

Key Techniques

  1. Int6 STE — Fake int6 per-row quantization ([-31,31]) on every forward pass with straight-through gradient bypass. Model learns to cope with quantization noise, yielding only ~+0.002 bpb gap.
  2. NorMuon optimizer — Row-normalized Newton-Schulz updates on top of Muon with adaptive step sizes.
  3. 3x MLP width (1536) — Enabled by int6 compression savings within 16MB budget.
  4. FP16 tied embedding — Embedding tensor stored in fp16, never quantized.
  5. Sliding window eval (stride=64) — Every token gets 960 tokens context (~0.033 bpb improvement).
  6. SWA — 7 checkpoint average during warmdown phase.
  7. Zstd-22 compression — Better ratio than zlib for quantized weights.
  8. U-Net skip connections — Encoder-decoder structure with learnable skip weights.

Architecture

  • 9 layers, 512 dim, 8 heads, 4 KV heads (GQA)
  • Vocab 1024 (SentencePiece BPE), seq len 1024
  • relu² activation, RoPE, logit softcapping (30.0)

Submission checklist

  • 3-seed verification with mean val_bpb
  • All artifacts < 16MB
  • Wallclock < 600s on 8xH100
  • Train logs included
  • Reproducible train_gpt.py included
  • submission.json with metadata

…019)

3-seed verified results:
- Seed 1337: val_bpb=1.16146
- Seed 42: val_bpb=1.15935
- Seed 7: val_bpb=1.15976
- Mean: 1.16019

Key techniques: int6 STE quantization-aware training, NorMuon optimizer,
3x MLP width (1536), FP16 tied embedding, sliding window eval (stride=64),
SWA with 7 checkpoints, zstd-22 compression, U-Net skip connections.
Added val_bpb, bytes_total, bytes_code, github_id fields expected
by the parameter-golf-leaderboard collector script.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant