Entry for the OpenAI Parameter Golf competition: train the best language model that fits in 16MB and trains in 10 minutes on 8xH100 SXM, scored by bits-per-byte (bpb) on the FineWeb validation set.
Submission: PR #379
Built on PR #374's SOTA stack (1.1246 bpb) with a novel post-training quantization improvement.
Standard int6 quantization uses row-wise absolute max for clipping. GPTQ-lite searches 5 clip percentiles per weight matrix (100%, 99.9%, 99.5%, 99%, 98%) and selects the one minimizing reconstruction error ||W - dequant(quant(W))||. This reduces quantization degradation at zero training cost. Zero new parameters, ~10s extra quantization time.
| Technique | Source | Effect |
|---|---|---|
| 11 layers, dim=512, MLP3x | Community consensus | More depth + wider MLP |
| Int6 QAT with STE | PR #135 | Better compression, MLP3x fits in 16MB |
| Tight SWA (scale<0.2, every 50 steps) | PR #374 | Zero-penalty weight averaging |
| XSA (Exclusive Self Attention) on last 4 layers | PR #287 | Removes self-value redundancy |
| Partial RoPE (16/64 dims) | PR #315 | Position-free attention on 75% of dims |
| LN Scale (1/sqrt(layer+1)) | PR #315 | Dampens deeper layers |
| Late QAT (enable at lr_scale < 0.1) | PR #315 | 3x less quantization degradation |
| Value Embedding (shared, layers 9-10) | PR #374 | Re-injects token identity in deep layers |
| zstd-22 compression | PR #86 | ~15-20% smaller artifact than zlib-9 |
| Muon optimizer, WD=0.04 | Community consensus | Better weight distributions |
| SmearGate + BigramHash(2048) | PR #135 | Token blending + bigram context |
| FA3 (FlashAttention 3, Hopper) | Built from source | ~85ms/step on H100 |
| Sliding window eval (stride=64) | PR #56 | Better context per scored token |
| GPTQ-lite clip search | Novel | Per-layer optimal quantization |
All experiments on 8xH100 SXM via Modal. Sliding window bpb (stride=64).
| Run | Config | bpb | Steps | ms/step | Notes |
|---|---|---|---|---|---|
| baseline | 9L vanilla | 1.2272 | 12351 | ~49 | Starting point |
| int6_mlp3x | 9L + int6/QAT/MLP3x | 1.1657 | 8325 | ~72 | First competitive result |
| resformer_off (NGC 25.01) | PR#287 base, SDPA fallback | 1.1362 | 5975 | 100 | FA3 unavailable |
| resformer_off (NGC 25.03) | PR#287 base, FA2 | 1.1315 | 6060 | 99 | Better torch.compile |
| diffattn (FA2) | PR#315 + DiffAttn, 4 FA calls | 1.1552 | 3882 | 154 | Per-step quality great, too slow |
| diffattn (FA3) | PR#315 + DiffAttn, FA3 | 1.2312 | 1863 | 322 | FA3 kernel overhead worse |
| sdttt_gptq | PR#374 + SDTTT + GPTQ, FA3 | 1.1260 | 6701 | 89.6 | SDTTT hurt (-0.0003), GPTQ helped |
| gptq_only | PR#374 + GPTQ, FA3 | 1.1257 | 6733 | 89.1 | Best result |
- GPTQ-lite: Per-layer clip percentile search reduces quantization degradation. Simple, fast, effective.
- FA3 from source: Building flash_attn_interface from the hopper/ subdirectory of Dao-AILab/flash-attention. Required for competitive step times.
- Leaderboard scraping: Built
leaderboard.pyto track 200+ PRs and extract technique frequency. Critical for strategy. - Trusting community data: Other participants' validated results (3-seed) are reliable data points. Don't re-run what's already proven.
- Differential Attention (4-call): 0.046 bpb better per step, but 4 FA calls = 154-322ms/step. Net negative. The architecture is right but the implementation is too expensive.
- Even/odd head differential: Zero-cost DiffAttn variant. Crashed 3 times with FA3 + torch.compile + DDP incompatibilities.
- Self-distillation TTT: KL-divergence adaptation post-training. Slightly negative (-0.0003 bpb). The KL constraint was too weak to improve on the already well-trained model.
- ResFormer value residual: Zero params, zero cost, but neutral at 26M scale. May help at larger scales.
- Depth recurrence: 50+ competition attempts by various participants, best result 0.044 behind SOTA. Unique layers beat shared layers at 16MB.
- NorMuon: Training instability without warmup. Noisy initial EMA + weight decay = destructive feedback loop.
- Too many changes at once: Stacking 6 changes simultaneously regressed badly. Always bisect.
- Quantization is the biggest untapped gap: SOTA loses 0.007 bpb to int6 quantization (~30% of total improvement over baseline). Most competition focus is on training quality.
- Eval-time compute is unconstrained: The 10-min limit applies to training only. Post-training optimizations (GPTQ, TTT, SWA averaging) have unlimited time.
- FA3 vs FA2 matters: ~15% step time difference = ~0.005 bpb from more training steps alone.
- The competition converges fast: With 200+ participants, simple technique combinations are found within hours. Novel contributions need to be non-obvious.
See RUN_GUIDE.md for running on your own 8xH100 node.
# Clone and run
git clone https://github.com/dannywillowliu-uchi/parameter-golf-entry.git
cd parameter-golf-entry
GPTQ_ENABLED=1 torchrun --standalone --nproc_per_node=8 pr374_train_gpt.py| File | Purpose |
|---|---|
pr374_train_gpt.py |
Main training script (PR #374 base + GPTQ-lite + SDTTT) |
sweep.py |
Experiment orchestration (configs, Modal submission, logging) |
modal_runner.py |
Modal cloud runner (NGC 25.03, FA3 build, 8xH100) |
leaderboard.py |
Live leaderboard scraper with technique frequency analysis |
results.jsonl |
All experiment results with full stdout/stderr |
train_gpt.py |
Our earlier modified training script (int6, QAT, kernels) |
kernels/ |
Custom Triton kernels (fused RMSNorm, cross-entropy, FP8) |
RUN_GUIDE.md |
Setup guide for running on external GPU nodes |
- Competition: OpenAI Parameter Golf (March 18 - April 30, 2026)
- Constraint: 16MB artifact, 10 min on 8xH100 SXM
- Metric: Bits-per-byte (bpb) on FineWeb validation set
- Leaderboard: parameter-golf.github.io
- Current SOTA: 1.1246 bpb (PR #374)
- Our position: 1.1257 bpb, rank ~3 (PR #379)