Skip to content

Record: Int6 MLP3x + MTP + Sliding Window Eval (val_bpb=1.1605)#88

Open
seanward wants to merge 1 commit intoopenai:mainfrom
seanward:submission/int6-mlp3x-mtp-slide-eval
Open

Record: Int6 MLP3x + MTP + Sliding Window Eval (val_bpb=1.1605)#88
seanward wants to merge 1 commit intoopenai:mainfrom
seanward:submission/int6-mlp3x-mtp-slide-eval

Conversation

@seanward
Copy link
Copy Markdown

Summary

New SOTA submission: val_bpb=1.1605 (sliding window stride=512, post int6+zstd quantization roundtrip).

7-technique stack:

  1. int6 per-row quantization [-31,31] with zstd-22 compression — saves ~25% artifact space vs int8+zlib
  2. 3× MLP expansion (MLP_HIDDEN=1536) — enabled by int6 savings
  3. seq4096 training — 4× longer context
  4. MTP auxiliary head (training-only, excluded from artifact) — free-FLOP gradient enrichment
  5. fp16 tied embedding passthrough — near-zero embedding quant gap
  6. Sliding window evaluation (stride=512) — near-full context scoring
  7. Co-optimized training dynamics — MATRIX_LR=0.02, MUON_MOMENTUM=0.99, WARMDOWN_ITERS=3000

3-Seed Validation

Seed Sliding Window BPB Artifact
1337 1.1605 15.28 MB
42 1.1645 15.12 MB
2024 1.1625 15.10 MB

Mean: 1.1625 BPB | Improvement: 0.110 nats over baseline | p = 0.00015

All artifacts under 16,000,000 bytes. Eval takes ~97s on 8×H100.

Developed by Maestro (iGent AI) working with Sean Ward (@seanward).

Requires pip install zstandard for zstd compression.

Files

  • train_gpt.py — self-contained script
  • 3 seed training logs (.txt)
  • submission.json + README.md

Created by Maestro on behalf of Sean Ward

View Session

…1.1605)

Co-authored-by: Sean Ward <seanmmward@gmail.com>
@krammnic krammnic mentioned this pull request Mar 19, 2026
unixmadtoonslab pushed a commit to unixmadtoonslab/parameter-golf that referenced this pull request Mar 20, 2026
Key improvements over baseline:
- Delayed QAT: STE fake-quantization only in last 15% of training time,
  allowing model to train at full precision before adapting to quantization
- Symmetric int6 clip range [-31, 31] instead of asymmetric [-32, 31]
- Wider MLP (3x), tuned LR=0.025, momentum=0.99 with 1500-step warmup
- Sliding window eval with stride=64 for better BPB measurement
- fp16 embedding passthrough (tok_emb kept unquantized)

3-seed validation (seeds 1337, 42, 7):
  1.15924, 1.15980, 1.16066 → mean 1.15990 BPB

Beats current openai#1 (PR openai#88) at 1.1605 BPB.
unixmadtoonslab added a commit to unixmadtoonslab/parameter-golf that referenced this pull request Mar 20, 2026
Key improvements over baseline:
- Delayed QAT: STE fake-quantization only in last 15% of training time,
  allowing model to train at full precision before adapting to quantization
- Symmetric int6 clip range [-31, 31] instead of asymmetric [-32, 31]
- Wider MLP (3x), tuned LR=0.025, momentum=0.99 with 1500-step warmup
- Sliding window eval with stride=64 for better BPB measurement
- fp16 embedding passthrough (tok_emb kept unquantized)

3-seed validation (seeds 1337, 42, 7):
  1.15924, 1.15980, 1.16066 → mean 1.15990 BPB

Beats current openai#1 (PR openai#88) at 1.1605 BPB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants