Skip to content

Record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training#114

Open
saml212 wants to merge 1 commit intoopenai:mainfrom
saml212:sam/int6-mlp3x-slide
Open

Record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training#114
saml212 wants to merge 1 commit intoopenai:mainfrom
saml212:sam/int6-mlp3x-slide

Conversation

@saml212
Copy link
Copy Markdown
Contributor

@saml212 saml212 commented Mar 19, 2026

val_bpb = 1.1574 — Top-ranked valid submission at time of posting

Baseline: 1.2244. Improvement: -0.067 BPB / -0.120 nats. 15.98MB artifact. 600s training + 240s sliding window eval on 8xH100 SXM.

This is the third PR in a series (see also PR #61 at 1.2154 and PR #96 at 1.1764), each building on systematic experimentation across ~60 runs.


What's Novel

1. Train@2048 matches train@4096 when using sliding window eval. I tested both extensively — training at 2048 gives identical BPB to training at 4096 (1.1764 vs 1.1765) when evaluated with sliding window. The window already provides long context at eval; the model just needs to learn local patterns. Training at 2048 is strictly better because it gets more optimization steps in the 10-minute budget. Other top submissions train at 1024 or 4096 — 2048 is the overlooked sweet spot.

2. GRAD_CLIP_NORM=0.3 stabilizes long-sequence training. I swept from 0.0 to 1.0 and found a narrow optimum at 0.3 for train@2048. The mechanism: longer sequences produce higher gradient variance, and tight clipping (0.3 vs the common 1.0) stabilizes without over-constraining. Full sweep data in experiment log.

3. Batch=786K is optimal for train@2048. Swept from 393K to 1M. The default 524K is suboptimal; 786K balances gradient noise against step count.

4. Selective precision preservation. Two techniques stacked: (a) FP16 tied embedding — the dual-role embedding/output matrix is disproportionately damaged by quantization, discovered via systematic warmdown-quantization analysis in PR #61; (b) Late-K passthrough — last 2 layers' key projections kept in fp16 for sharper late-layer attention.

5. Stride=256 beats stride=64. Counter to other submissions using stride=64, I found stride=256 gives slightly better BPP (1.1574 vs 1.1579) at 4x less eval time. Diminishing returns from smaller strides.

Approach

Int6 post-training quantization (per-row scaling, ±31 range) compresses weight matrices ~25% smaller than int8, freeing artifact space to triple the MLP hidden dimension (1024 → 1536). 21.8M parameters in 15.98MB.

Component Precision
Weight matrices (Q, K, V, O, MLP) int6 (64 levels)
Tied embedding (tok_emb.weight) fp16
Last 2 layers c_k.weight fp16
Control tensors (scales, mixes, gains) fp16

Training

TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=786432 MLP_HIDDEN=1536
MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 GRAD_CLIP_NORM=0.3

Evaluation

Sliding window with stride=256 at eval_seq_len=2048. Every scored token sees 1792+ tokens of context. 240s eval time on 8xH100 (well within the separate 10-min eval budget).

Eval Method val_bpb
Non-overlapping 1.1792
Sliding stride=256 1.1574

Progression

PR BPB Key Advance
#61 1.2154 Warmdown-quantization discovery (WD=20000 reduces quant penalty 3x)
#96 1.1764 Long-context training + sliding window + clip=0.3 + batch=786K
#114 1.1574 Int6 + MLP 3x + selective precision

Full experiment log with ~60 runs, sweep data, and negative results available in the repository.

@saml212 saml212 changed the title Int6 + MLP 3x + sliding window: val_bpb=1.1574 #1 on leaderboard: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training Mar 19, 2026
NotADevIAmaMeatPopsicle added a commit to NotADevIAmaMeatPopsicle/parameter-golf that referenced this pull request Mar 19, 2026
Root cause: quantize_float_tensor used raw abs().amax() for scale
computation. With only 31 int6 levels, outlier weights inflate the
scale and collapse most values to zero. Fixed by using percentile
clipping (99.99984th quantile) matching PR openai#114's proven approach.

Also fixed asymmetric range [-32,31] -> symmetric [-31,31].

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@0hq
Copy link
Copy Markdown
Collaborator

0hq commented Mar 19, 2026

Thanks for the great submission! Before I officially add to leaderboard, mind running again to verify that the 1.1574 result is within noise? 1-2 more runs that show the same result would be great.

NotADevIAmaMeatPopsicle added a commit to NotADevIAmaMeatPopsicle/parameter-golf that referenced this pull request Mar 19, 2026
- SWA during warmdown: average ~7 checkpoints in second half of
  warmdown for flatter loss landscape generalization (-0.003-0.005 BPB)
- GRAD_CLIP_NORM=0.3: stabilizes train@2048 (from PR openai#114 sweep)
- TRAIN_BATCH_TOKENS=786432: larger batch for seq2048 (from PR openai#114)
- EVAL_STRIDE=256: faster eval, equal or better BPB than stride=64

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@saml212
Copy link
Copy Markdown
Contributor Author

saml212 commented Mar 20, 2026

@0hq totally fair, happy to rerun! My plane is taking off now though and Runpod doesn't have an 8xH100 pod available currently so I won't get to this until late tonight.

@saml212
Copy link
Copy Markdown
Contributor Author

saml212 commented Mar 20, 2026

@0hq Verification complete. Reran seeds 1337/1338/1339 on a fresh pod with the official RunPod template (py3.12 + torch 2.9.1+cu128):

Seed Sliding BPB Steps ms/step
1337 1.1599 7,280 81.6
1338 1.1558 7,341 81.7
1339 1.1565 7,339 81.8
Mean 1.1574 7,320 81.7

Mean matches the original submission exactly. Seed-to-seed variance (std=0.0022) is from training stochasticity — Muon's distributed gradient round-trip introduces small numerical differences across runs. The original 3 seeds on my dev pod had std=0.0001 which was unusually tight.

Note: first run on a fresh pod was 100ms/step (cold torch.compile cache). After cache warms up it stabilizes at ~81.7ms/step. The verification seeds above are all warm-cache runs. Verification logs pushed to the branch.

@saml212 saml212 force-pushed the sam/int6-mlp3x-slide branch from 6cc7150 to 2669909 Compare March 23, 2026 18:22
@saml212 saml212 changed the title #1 on leaderboard: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training Mar 23, 2026
@saml212 saml212 changed the title record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training Record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training Mar 23, 2026
@saml212
Copy link
Copy Markdown
Contributor Author

saml212 commented Mar 23, 2026

3-seed verified, diff cleaned up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants